Paperid: 1, https://arxiv.org/pdf/2506.24019.pdf   GitHub
Authors:Hongxin Zhang, Zheyuan Zhang, Zeyuan Wang, Zunzhe Zhang, Lixing Fang, Qinhong Zhou, Chuang Gan
Title: Ella: Embodied Social Agents with Lifelong Memory
Abstract:
We introduce Ella, an embodied social agent capable of lifelong learning within a community in a 3D open world, where agents accumulate experiences and acquire knowledge through everyday visual observations and social interactions. At the core of Ella's capabilities is a structured, long-term multimodal memory system that stores, updates, and retrieves information effectively. It consists of a name-centric semantic memory for organizing acquired knowledge and a spatiotemporal episodic memory for capturing multimodal experiences. By integrating this lifelong memory system with foundation models, Ella retrieves relevant information for decision-making, plans daily activities, builds social relationships, and evolves autonomously while coexisting with other intelligent beings in the open world. We conduct capability-oriented evaluations in a dynamic 3D open world where 15 agents engage in social activities for days and are assessed with a suite of unseen controlled evaluations. Experimental results show that Ella can influence, lead, and cooperate with other agents well to achieve goals, showcasing its ability to learn effectively through observation and social interaction. Our findings highlight the transformative potential of combining structured memory systems with foundation models for advancing embodied intelligence. More videos can be found at https://umass-embodied-agi.github.io/Ella/.

Authors:Hyunjong Kim, Sangyeop Kim, Jongheon Jeong, Yeongjae Cho, Sungzoon Cho
Title: EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations
Abstract:
Recent advances in large language models and vision-language models have led to growing interest in explainable evaluation metrics for image captioning. However, these metrics generate explanations without standardized criteria, and the overall quality of the generated explanations remains unverified. In this paper, we propose EXPERT, a reference-free evaluation metric that provides structured explanations based on three fundamental criteria: fluency, relevance, and descriptiveness. By constructing large-scale datasets of high-quality structured explanations, we develop a two-stage evaluation template to effectively supervise a vision-language model for both scoring and explanation generation. EXPERT achieves state-of-the-art results on benchmark datasets while providing significantly higher-quality explanations than existing metrics, as validated through comprehensive human evaluation. Our code and datasets are available at https://github.com/hjkim811/EXPERT.
Chinese: 近期语言模型的进展推动了对可解释图像描述评估指标的需求,因此开发了EXPERT这一无参考系统,它基于流畅性、相关性和描述性提供结构化评估,实现了顶尖性能和高品质解释。
English: Recent progress in language models has spurred the need for explainable image captioning metrics, leading to the development of EXPERT, a reference-free system that provides structured evaluations based on fluency, relevance, and descriptiveness, achieving top performance and high-quality explanations.

Authors:JiaRu Wu, Mingwei Liu
Title: AutoEvoEval: An Automated Framework for Evolving Close-Ended LLM Evaluation Data
Abstract:
Large language models (LLMs) have shown remarkable performance on various tasks, but existing evaluation benchmarks are often static and insufficient to fully assess their robustness and generalization in realistic scenarios. Prior work using evolutionary or adversarial data augmentation has improved evaluation diversity but lacks systematic control over perturbation types and multi-step complexity, limiting comprehensive robustness analysis. To address these gaps, we propose AutoEvoEval, an evolution-based evaluation framework for close-ended tasks such as multi-choice question answering. AutoEvoEval introduces 22 interpretable atomic evolution operations and supports multi-round compositions, enabling controlled generation of diverse, challenging, and realistic test samples. We conduct extensive experiments addressing four research questions on a broad set of open- and closed-source LLMs. Our results show that atomic operations cause an average accuracy drop of 7.283\%, with structure-disrupting or misleading semantic edits causing the largest declines. Model sensitivities vary significantly for the same perturbation, and combining multiple evolution steps amplifies adversarial effects by up to 52.932\%. These findings suggest current benchmarks may overestimate true model generalization and emphasize the need for evolution-aware robustness evaluation. Code and resources are available at: https://github.com/SYSUSELab/AutoEvoEval.
AutoEvoEval introduces an evolution-based framework with 22 atomic operations to systematically evaluate LLM robustness, revealing significant accuracy drops from perturbations and exposing overestimated generalization in current benchmarks.
English Summary:

Authors:Junjie Zhang, Jingyi Xi, Zhuoyang Song, Junyu Lu, Yuhua Ke, Ting Sun, Yukun Yang, Jiaxing Zhang, Songxin Zhang, Zejian Xie
Title: L0: Reinforcement Learning to Become General Agents
Abstract:
Training large language models (LLMs) to act as autonomous agents for multi-turn, long-horizon tasks remains significant challenges in scalability and training efficiency. To address this, we introduce L-Zero (L0), a scalable, end-to-end training pipeline for general-purpose agents. Featuring a low-cost, extensible, and sandboxed concurrent agent worker pool, L0 lowers the barrier for applying reinforcement learning in complex environments. We also introduce NB-Agent, the agent scaffold within L0, which operates in a "code-as-action" fashion via a Read-Eval-Print-Loop (REPL). We evaluate L0 on factuality question-answering benchmarks. Our experiments demonstrate that a base model can develop robust problem-solving skills using solely Reinforcement Learning with Verifiable Rewards (RLVR). On the Qwen2.5-7B-Instruct model, our method boosts accuracy on SimpleQA from 30 % to 80 % and on HotpotQA from 22 % to 41 %. We have open-sourced the entire L0 system, including our L0 series models, the NB-Agent, a complete training pipeline, and the corresponding training recipes on (https://github.com/cmriat/l0).
中文摘要:L-Zero (L0) 系统提出了一种可扩展的训练框架,通过强化学习使大语言模型具备强大的问题解决能力,在SimpleQA和HotpotQA等事实性基准测试中显著提升了准确率。
English Summary: The L-Zero (L0) system introduces a scalable training pipeline that enables large language models to develop robust problem-solving skills through reinforcement learning, significantly improving accuracy on factuality benchmarks like SimpleQA and HotpotQA.

Authors:Arnisa Fazla, Lucas Krauter, David Guzman Piedrahita, Andrianos Michail
Title: Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack
Abstract:
We extend BeamAttack, an adversarial attack algorithm designed to evaluate the robustness of text classification systems through word-level modifications guided by beam search. Our extensions include support for word deletions and the option to skip substitutions, enabling the discovery of minimal modifications that alter model predictions. We also integrate LIME to better prioritize word replacements. Evaluated across multiple datasets and victim models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA framework, our approach achieves over a 99\% attack success rate while preserving the semantic and lexical similarity of the original texts. Through both quantitative and qualitative analysis, we highlight BeamAttack's effectiveness and its limitations. Our implementation is available at https://github.com/LucK1Y/BeamAttack
中文摘要:本研究扩展了BeamAttack算法,通过增加单词删除和LIME引导的替换功能,在保持文本相似度的同时实现了超过99%的攻击成功率,并在多数据集和模型上验证了其有效性与局限性。
English Summary: The study enhances BeamAttack by incorporating word deletions and LIME-guided substitutions, achieving over 99% attack success on text classifiers while maintaining text similarity, as validated through comprehensive evaluations.

Authors:Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao
Title: MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
Abstract:
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an open-ended format and filtered using a multi-model voting technique to eliminate shortcut cases related to guessing and memorization, ensuring robust reasoning evaluations. Third, we annotate the questions with detailed step-by-step solutions, and design a reference-based ternary scoring mechanism to reliably assess intermediate reasoning steps. With MMReason, we benchmark popular leading MLLMs and provide an in-depth analysis of their reasoning capabilities. We hope MMReason will serve as a valuable resource for advancing MLLM reasoning research. Code will be available at https://github.com/HJYao00/MMReason.
Chinese: MMReason基准通过提供需要多步骤推理的多样化挑战性问题,结合详细解答和三值评分机制,旨在弥补现有MLLM评估在长链推理能力检测上的不足,实现精准全面的能力评估。
English: The MMReason benchmark is introduced to address the limitations of existing MLLM evaluations by providing diverse, challenging questions that require multi-step reasoning, incorporating detailed solutions and a ternary scoring mechanism to accurately assess long-chain reasoning capabilities.

Authors:Haocheng Yu, Yaxiong Wu, Hao Wang, Wei Guo, Yong Liu, Yawen Li, Yuyang Ye, Junping Du, Enhong Chen
Title: Thought-Augmented Planning for LLM-Powered Interactive Recommender Agent
Abstract:
Interactive recommendation is a typical information-seeking task that allows users to interactively express their needs through natural language and obtain personalized recommendations. Large language model-powered (LLM-powered) agents have become a new paradigm in interactive recommendations, effectively capturing users' real-time needs and enhancing personalized experiences. However, due to limited planning and generalization capabilities, existing formulations of LLM-powered interactive recommender agents struggle to effectively address diverse and complex user intents, such as intuitive, unrefined, or occasionally ambiguous requests. To tackle this challenge, we propose a novel thought-augmented interactive recommender agent system (TAIRA) that addresses complex user intents through distilled thought patterns. Specifically, TAIRA is designed as an LLM-powered multi-agent system featuring a manager agent that orchestrates recommendation tasks by decomposing user needs and planning subtasks, with its planning capacity strengthened through Thought Pattern Distillation (TPD), a thought-augmentation method that extracts high-level thoughts from the agent's and human experts' experiences. Moreover, we designed a set of user simulation schemes to generate personalized queries of different difficulties and evaluate the recommendations based on specific datasets. Through comprehensive experiments conducted across multiple datasets, TAIRA exhibits significantly enhanced performance compared to existing methods. Notably, TAIRA shows a greater advantage on more challenging tasks while generalizing effectively on novel tasks, further validating its superiority in managing complex user intents within interactive recommendation systems. The code is publicly available at:https://github.com/Alcein/TAIRA.
中文: 提出的TAIRA系统通过思维模式蒸馏的多智能体框架,有效处理复杂用户意图,在多个数据集上展现出优于现有方法的交互推荐性能。
English: The proposed TAIRA system enhances interactive recommendation by using a multi-agent framework with thought pattern distillation to effectively address complex user intents, demonstrating superior performance across diverse datasets.

Authors:WonJune Jang
Title: What to Keep and What to Drop: Adaptive Table Filtering Framework
Abstract:
Large language models (LLMs) for table-based reasoning often struggle with large tables due to input length limits. We propose ATF (Adaptive Table Filtering Framework), a modular and question-aware filtering pipeline that prunes uninformative columns and rows using LLM-generated column descriptions, clustering, and sparse-dense alignment scores. ATF integrates seamlessly with existing models (e.g., TAPAS, TAPEX) without retraining. Experiments show that ATF reduces table cells by 70%, boosting performance on out-of-domain TableQA tasks while causing slight performance drops on Table Fact Verification, where full-table context is more critical. These results highlight ATF's ability to adaptively balance informativeness and minimalism across tasks. Our code available at: https://github.com/torijune/ATF-Adaptive-Table-Filtering-Framework
中文: ATF框架通过LLM生成的列描述和对齐分数自适应地过滤大型表格中的非信息性行列,将表格大小减少70%,在TableQA任务中提升性能,但在需要完整上下文的表格事实核查任务中略有下降。
English: The ATF framework adaptively filters uninformative columns and rows from large tables using LLM-generated descriptions and alignment scores, reducing table size by 70% while improving performance on TableQA tasks but slightly decreasing accuracy in Table Fact Verification where complete context is essential.

Authors:Jiale Zhang, Zichong Wang, Avash Palikhe, Zhipeng Yin, Wenbin Zhang
Title: Datasets for Fairness in Language Models: An In-Depth Survey
Abstract:
Despite the growing reliance on fairness benchmarks to evaluate language models, the datasets that underpin these benchmarks remain critically underexamined. This survey addresses that overlooked foundation by offering a comprehensive analysis of the most widely used fairness datasets in language model research. To ground this analysis, we characterize each dataset across key dimensions, including provenance, demographic scope, annotation design, and intended use, revealing the assumptions and limitations baked into current evaluation practices. Building on this foundation, we propose a unified evaluation framework that surfaces consistent patterns of demographic disparities across benchmarks and scoring metrics. Applying this framework to sixteen popular datasets, we uncover overlooked biases that may distort conclusions about model fairness and offer guidance on selecting, combining, and interpreting these resources more effectively and responsibly. Our findings highlight an urgent need for new benchmarks that capture a broader range of social contexts and fairness notions. To support future research, we release all data, code, and results at https://github.com/vanbanTruong/Fairness-in-Large-Language-Models/tree/main/datasets, fostering transparency and reproducibility in the evaluation of language model fairness.
中文: 本研究对语言模型公平性评估数据集进行系统性分析,揭示潜在偏见并提出统一评估框架,发现人口统计差异,呼吁建立涵盖更广社会背景的新基准。
English: This survey critically analyzes widely used fairness datasets for language models, revealing embedded biases and proposing a unified evaluation framework that uncovers demographic disparities while advocating for more comprehensive benchmarks.

Authors:Paige Tuttösí, H. Henny Yeung, Yue Wang, Jean-Julien Aucouturier, Angelica Lim
Title: You Sound a Little Tense: L2 Tailored Clear TTS Using Durational Vowel Properties
Abstract:
We present the first text-to-speech (TTS) system tailored to second language (L2) speakers. We use duration differences between American English tense (longer) and lax (shorter) vowels to create a "clarity mode" for Matcha-TTS. Our perception studies showed that French-L1, English-L2 listeners had fewer (at least 9.15%) transcription errors when using our clarity mode, and found it more encouraging and respectful than overall slowed down speech. Remarkably, listeners were not aware of these effects: despite the decreased word error rate in clarity mode, listeners still believed that slowing all target words was the most intelligible, suggesting that actual intelligibility does not correlate with perceived intelligibility. Additionally, we found that Whisper-ASR did not use the same cues as L2 speakers to differentiate difficult vowels and is not sufficient to assess the intelligibility of TTS systems for these individuals.
中文: 本研究针对二语学习者开发了首个文本转语音系统,通过延长元音时长差异的清晰模式显著减少了转录错误并提升了感知鼓励度,同时揭示了实际可懂度与感知可懂度的差异,以及自动语音识别系统在此类评估中的局限性。
English: This study introduces a specialized text-to-speech system for second language learners, featuring a clarity mode that enhances vowel duration contrast to significantly reduce transcription errors and improve perceived encouragement, while revealing a disconnect between actual and perceived intelligibility and limitations of automated speech recognition for this population.

Authors:David Guzman Piedrahita, Yongjin Yang, Mrinmaya Sachan, Giorgia Ramponi, Bernhard Schölkopf, Zhijing Jin
Title: Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games
Abstract:
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding their cooperation and social mechanisms is becoming increasingly important. In particular, how LLMs balance self-interest and collective well-being is a critical challenge for ensuring alignment, robustness, and safe deployment. In this paper, we examine the challenge of costly sanctioning in multi-agent LLM systems, where an agent must decide whether to invest its own resources to incentivize cooperation or penalize defection. To study this, we adapt a public goods game with institutional choice from behavioral economics, allowing us to observe how different LLMs navigate social dilemmas over repeated interactions. Our analysis reveals four distinct behavioral patterns among models: some consistently establish and sustain high levels of cooperation, others fluctuate between engagement and disengagement, some gradually decline in cooperative behavior over time, and others rigidly follow fixed strategies regardless of outcomes. Surprisingly, we find that reasoning LLMs, such as the o1 series, struggle significantly with cooperation, whereas some traditional LLMs consistently achieve high levels of cooperation. These findings suggest that the current approach to improving LLMs, which focuses on enhancing their reasoning capabilities, does not necessarily lead to cooperation, providing valuable insights for deploying LLM agents in environments that require sustained collaboration. Our code is available at https://github.com/davidguzmanp/SanctSim
中文: 本研究探讨大型语言模型在多智能体系统中如何平衡自身利益与集体福祉,发现增强推理能力未必促进合作,某些传统模型在维持协作行为方面反而优于专注推理的模型。
English: This study investigates how large language models (LLMs) balance self-interest and collective welfare in multi-agent systems, revealing that enhanced reasoning capabilities do not necessarily foster cooperation, with some traditional models outperforming reasoning-focused ones in sustaining collaborative behavior.

Authors:Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li
Title: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Abstract:
Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
中文摘要:UrbanLLaVA是一个多模态大语言模型,通过构建城市指令数据集和多阶段训练框架,能够同时处理多种城市数据类型,在各类城市任务中表现优于现有模型。
English Summary: UrbanLLaVA is a multi-modal large language model designed to process diverse urban data types simultaneously, outperforming existing models across various urban tasks through a curated dataset and multi-stage training framework.

Authors:Gabriel Iturra-Bocaz, Felipe Bravo-Marquez
Title: RiverText: A Python Library for Training and Evaluating Incremental Word Embeddings from Text Data Streams
Abstract:
Word embeddings have become essential components in various information retrieval and natural language processing tasks, such as ranking, document classification, and question answering. However, despite their widespread use, traditional word embedding models present a limitation in their static nature, which hampers their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To overcome this problem, incremental word embedding algorithms are introduced, capable of dynamically updating word representations in response to new language patterns and processing continuous data streams. This paper presents RiverText, a Python library for training and evaluating incremental word embeddings from text data streams. Our tool is a resource for the information retrieval and natural language processing communities that work with word embeddings in streaming scenarios, such as analyzing social media. The library implements different incremental word embedding techniques, such as Skip-gram, Continuous Bag of Words, and Word Context Matrix, in a standardized framework. In addition, it uses PyTorch as its backend for neural network training. We have implemented a module that adapts existing intrinsic static word embedding evaluation tasks for word similarity and word categorization to a streaming setting. Finally, we compare the implemented methods with different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext.
中文摘要:RiverText是一个Python库,用于从文本数据流中训练和评估增量词嵌入,通过集成Skip-gram和CBOW等动态更新技术,解决了传统静态词嵌入模型无法适应语言演变的问题,并提供了标准化评估框架。
English Summary: RiverText is a Python library designed for training and evaluating incremental word embeddings from text streams, addressing the limitations of static models by dynamically updating word representations using techniques like Skip-gram and CBOW within a PyTorch framework.

Authors:Siyuan Li, Ruitong Liu, Yan Wen, Te Sun, Andi Zhang, Yanbiao Ma, Xiaoshuai Hao
Title: Flow-Modulated Scoring for Semantic-Aware Knowledge Graph Completion
Abstract:
Knowledge graph completion demands effective modeling of multifaceted semantic relationships between entities. Yet, prevailing methods, which rely on static scoring functions over learned embeddings, struggling to simultaneously capture rich semantic context and the dynamic nature of relations. To overcome this limitation, we propose the Flow-Modulated Scoring (FMS) framework, conceptualizing a relation as a dynamic evolutionary process governed by its static semantic environment. FMS operates in two stages: it first learns context-aware entity embeddings via a Semantic Context Learning module, and then models a dynamic flow between them using a Conditional Flow-Matching module. This learned flow dynamically modulates a base static score for the entity pair. By unifying context-rich static representations with a conditioned dynamic flow, FMS achieves a more comprehensive understanding of relational semantics. Extensive experiments demonstrate that FMS establishes a new state of the art across both canonical knowledge graph completion tasks: relation prediction and entity prediction. On the standard relation prediction benchmark FB15k-237, FMS achieves a near-perfect MRR of 99.8\% and Hits@1 of 99.7\% using a mere 0.35M parameters, while also attaining a 99.9\% MRR on WN18RR. Its dominance extends to entity prediction, where it secures a 25.2\% relative MRR gain in the transductive setting and substantially outperforms all baselines in challenging inductive settings. By unifying a dynamic flow mechanism with rich static contexts, FMS offers a highly effective and parameter-efficient new paradigm for knowledge graph completion. Code published at: https://github.com/yuanwuyuan9/FMS.
Chinese: 提出的流调制评分(FMS)框架通过结合上下文感知的静态嵌入与动态流机制,在关系和实体预测任务中实现了最先进的性能,并具有高效的参数利用率。
English: The proposed Flow-Modulated Scoring (FMS) framework enhances knowledge graph completion by integrating context-aware static embeddings with a dynamic flow mechanism, achieving state-of-the-art performance in both relation and entity prediction tasks with high parameter efficiency.

Authors:Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou
Title: MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
Abstract:
Multimodal embedding models, built upon causal Vision Language Models (VLMs), have shown promise in various tasks. However, current approaches face three key limitations: the use of causal attention in VLM backbones is suboptimal for embedding tasks; scalability issues due to reliance on high-quality labeled paired data for contrastive learning; and limited diversity in training objectives and data. To address these issues, we propose MoCa, a two-stage framework for transforming pre-trained VLMs into effective bidirectional multimodal embedding models. The first stage, Modality-aware Continual Pre-training, introduces a joint reconstruction objective that simultaneously denoises interleaved text and image inputs, enhancing bidirectional context-aware reasoning. The second stage, Heterogeneous Contrastive Fine-tuning, leverages diverse, semantically rich multimodal data beyond simple image-caption pairs to enhance generalization and alignment. Our method addresses the stated limitations by introducing bidirectional attention through continual pre-training, scaling effectively with massive unlabeled datasets via joint reconstruction objectives, and utilizing diverse multimodal data for enhanced representation robustness. Experiments demonstrate that MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results, and exhibits strong scalability with both model size and training data on MMEB.
中文摘要:提出的MoCa框架通过模态感知持续预训练和异构对比微调,将预训练视觉语言模型转化为双向多模态嵌入模型,在解决注意力机制、数据可扩展性和训练多样性限制的同时,在多个基准测试中实现了最先进的性能。
English Summary: The proposed MoCa framework transforms pre-trained Vision Language Models into bidirectional multimodal embedding models through modality-aware continual pre-training and heterogeneous contrastive fine-tuning, achieving state-of-the-art performance on benchmarks while addressing limitations in attention mechanisms, data scalability, and training diversity.

Authors:Zhengren Wang, Bozhou Li, Dongwen Yao, Wentao Zhang
Title: Text2VectorSQL: Bridging Text-to-SQL and Vector Search for Unified Natural Language Queries
Abstract:
While Text-to-SQL enables natural language interaction with structured databases, its effectiveness diminishes with unstructured data or ambiguous queries due to rigid syntax and limited expressiveness. Concurrently, vector search has emerged as a powerful paradigm for semantic retrieval, particularly for unstructured data. However, existing VectorSQL implementations still rely heavily on manual crafting and lack tailored evaluation frameworks, leaving a significant gap between theoretical potential and practical deployment. To bridge these complementary paradigms, we introduces Text2VectorSQL, a novel framework unifying Text-to-SQL and vector search to overcome expressiveness constraints and support more diverse and holistical natural language queries. Specifically, Text2VectorSQL enables semantic filtering, multi-modal matching, and retrieval acceleration. For evaluation, we build vector index on appropriate columns, extend user queries with semantic search, and annotate ground truths via an automatic pipeline with expert review. Furthermore, we develop dedicated Text2VectorSQL models with synthetic data, demonstrating significant performance improvements over baseline methods. Our work establishes the foundation for the Text2VectorSQL task, paving the way for more versatile and intuitive database interfaces. The repository will be publicly available at https://github.com/Open-DataFlow/Text2VectorSQL.
Chinese: Text2VectorSQL 是一个创新框架,将 Text-to-SQL 与向量搜索相结合,以克服表达限制并支持更多样化的自然语言查询,通过定制模型和评估方法展现出显著的性能提升。
English: Text2VectorSQL is a novel framework that integrates Text-to-SQL and vector search to enhance query expressiveness and support diverse natural language interactions, demonstrating significant performance improvements through tailored models and evaluation methods.

Authors:Xiang Zhuang, Bin Wu, Jiyu Cui, Kehua Feng, Xiaotong Li, Huabin Xing, Keyan Ding, Qiang Zhang, Huajun Chen
Title: Boosting LLM's Molecular Structure Elucidation with Knowledge Enhanced Tree Search Reasoning
Abstract:
Molecular structure elucidation involves deducing a molecule's structure from various types of spectral data, which is crucial in chemical experimental analysis. While large language models (LLMs) have shown remarkable proficiency in analyzing and reasoning through complex tasks, they still encounter substantial challenges in molecular structure elucidation. We identify that these challenges largely stem from LLMs' limited grasp of specialized chemical knowledge. In this work, we introduce a Knowledge-enhanced reasoning framework for Molecular Structure Elucidation (K-MSE), leveraging Monte Carlo Tree Search for test-time scaling as a plugin. Specifically, we construct an external molecular substructure knowledge base to extend the LLMs' coverage of the chemical structure space. Furthermore, we design a specialized molecule-spectrum scorer to act as a reward model for the reasoning process, addressing the issue of inaccurate solution evaluation in LLMs. Experimental results show that our approach significantly boosts performance, particularly gaining more than 20% improvement on both GPT-4o-mini and GPT-4o. Our code is available at https://github.com/HICAI-ZJU/K-MSE.
Chinese: 本文提出的K-MSE框架通过引入外部化学知识库和专用分子谱评估器,显著提升了大型语言模型在分子结构解析任务中的性能,在GPT-4系列模型上实现了超过20%的性能提升。
English: This paper introduces K-MSE, a knowledge-enhanced framework that significantly improves molecular structure elucidation in large language models by integrating external chemical knowledge and a specialized reward mechanism, achieving over 20% performance gains on GPT-4 models.

Authors:Yida Zhao, Hao Xve, Xiang Hu, Kewei Tu
Title: A Systematic Study of Compositional Syntactic Transformer Language Models
Abstract:
Syntactic language models (SLMs) enhance Transformers by incorporating syntactic biases through the modeling of linearized syntactic parse trees alongside surface sentences. This paper focuses on compositional SLMs that are based on constituency parse trees and contain explicit bottom-up composition of constituent representations. We identify key aspects of design choices in existing compositional SLMs and propose a unified framework encompassing both existing models and novel variants. We conduct a comprehensive empirical evaluation of all the variants in our framework across language modeling, syntactic generalization, summarization, dialogue, and inference efficiency. Based on the experimental results, we make multiple recommendations on the design of compositional SLMs. Our code is released at https://github.com/zhaoyd1/compositional_SLMs.
中文摘要:本文提出了一个基于成分句法树的组合式句法语言模型统一框架,通过多任务实验评估模型性能,并根据实验结果提出了多项设计建议。
English Summary: This paper introduces a unified framework for compositional syntactic language models that incorporate constituency parse trees, evaluates their performance across various tasks, and provides design recommendations based on comprehensive experiments.

Authors:Asen Dotsinski, Udit Thakur, Marko Ivanov, Mohammad Hafeez Khan, Maria Heuss
Title: On the Generalizability of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals"
Abstract:
We present a reproduction study of "Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals" (Ortu et al., 2024), which investigates competition of mechanisms in language models between factual recall and counterfactual in-context repetition. Our study successfully reproduces their primary findings regarding the localization of factual and counterfactual information, the dominance of attention blocks in mechanism competition, and the specialization of attention heads in handling competing information. We reproduce their results on both GPT-2 (Radford et al., 2019) and Pythia 6.9B (Biderman et al., 2023). We extend their work in three significant directions. First, we explore the generalizability of these findings to even larger models by replicating the experiments on Llama 3.1 8B (Grattafiori et al., 2024), discovering greatly reduced attention head specialization. Second, we investigate the impact of prompt structure by introducing variations where we avoid repeating the counterfactual statement verbatim or we change the premise word, observing a marked decrease in the logit for the counterfactual token. Finally, we test the validity of the authors' claims for prompts of specific domains, discovering that certain categories of prompts skew the results by providing the factual prediction token as part of the subject of the sentence. Overall, we find that the attention head ablation proposed in Ortu et al. (2024) is ineffective for domains that are underrepresented in their dataset, and that the effectiveness varies based on model architecture, prompt structure, domain and task.
中文: 这项复制研究证实了Ortu等人关于语言模型中注意力机制处理事实与反事实信息的发现,同时扩展研究揭示了更大模型中注意力头专门化程度降低、对提示结构的敏感性增强,以及所提出消融方法在不同领域有效性存在差异的局限性。
English: This reproduction study confirms Ortu et al.'s findings about attention mechanisms handling factual and counterfactual information in language models, while extending the research to reveal limitations in attention head specialization across larger models, sensitivity to prompt structures, and domain-specific effectiveness of their proposed ablation method.

Authors:Younwoo Choi, Changling Li, Yongjin Yang, Zhijing Jin
Title: Agent-to-Agent Theory of Mind: Testing Interlocutor Awareness among Large Language Models
Abstract:
As large language models (LLMs) are increasingly integrated into multi-agent and human-AI systems, understanding their awareness of both self-context and conversational partners is essential for ensuring reliable performance and robust safety. While prior work has extensively studied situational awareness which refers to an LLM's ability to recognize its operating phase and constraints, it has largely overlooked the complementary capacity to identify and adapt to the identity and characteristics of a dialogue partner. In this paper, we formalize this latter capability as interlocutor awareness and present the first systematic evaluation of its emergence in contemporary LLMs. We examine interlocutor inference across three dimensions-reasoning patterns, linguistic style, and alignment preferences-and show that LLMs reliably identify same-family peers and certain prominent model families, such as GPT and Claude. To demonstrate its practical significance, we develop three case studies in which interlocutor awareness both enhances multi-LLM collaboration through prompt adaptation and introduces new alignment and safety vulnerabilities, including reward-hacking behaviors and increased jailbreak susceptibility. Our findings highlight the dual promise and peril of identity-sensitive behavior in LLMs, underscoring the need for further understanding of interlocutor awareness and new safeguards in multi-agent deployments. Our code is open-sourced at https://github.com/younwoochoi/InterlocutorAwarenessLLM.
Chinese: 本研究提出对话者意识作为大型语言模型识别和适应对话伙伴的关键能力,揭示了其在提升多智能体协作效率的同时,也带来了奖励破解和越狱风险等新型安全隐患。
English: This study introduces interlocutor awareness as a critical capability for large language models (LLMs) to identify and adapt to dialogue partners, demonstrating its dual role in enhancing multi-agent collaboration while introducing new safety vulnerabilities like reward hacking and jailbreak risks.

Authors:Mai A. Shaaban, Tausifa Jan Saleem, Vijay Ram Papineni, Mohammad Yaqub
Title: MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering
Abstract:
Medical visual question answering (MedVQA) plays a vital role in clinical decision-making by providing contextually rich answers to image-based queries. Although vision-language models (VLMs) are widely used for this task, they often generate factually incorrect answers. Retrieval-augmented generation addresses this challenge by providing information from external sources, but risks retrieving irrelevant context, which can degrade the reasoning capabilities of VLMs. Re-ranking retrievals, as introduced in existing approaches, enhances retrieval relevance by focusing on query-text alignment. However, these approaches neglect the visual or multimodal context, which is particularly crucial for medical diagnosis. We propose MOTOR, a novel multimodal retrieval and re-ranking approach that leverages grounded captions and optimal transport. It captures the underlying relationships between the query and the retrieved context based on textual and visual information. Consequently, our approach identifies more clinically relevant contexts to augment the VLM input. Empirical analysis and human expert evaluation demonstrate that MOTOR achieves higher accuracy on MedVQA datasets, outperforming state-of-the-art methods by an average of 6.45%. Code is available at https://github.com/BioMedIA-MBZUAI/MOTOR.
Chinese: MOTOR提出了一种多模态检索与重排序方法,利用基础描述和最优传输融合视觉与文本信息,显著提升了医疗视觉问答中检索内容的临床相关性,平均准确率比现有最优方法高出6.45%。
English: MOTOR introduces a multimodal retrieval and re-ranking method that integrates visual and textual information through grounded captions and optimal transport, significantly enhancing the relevance of retrieved contexts for medical visual question answering and achieving a 6.45% average accuracy improvement over existing methods.

Authors:Kyochul Jang, Donghyeon Lee, Kyusik Kim, Dongseok Heo, Taewhoo Lee, Woojeong Kim, Bongwon Suh
Title: DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues
Abstract:
Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.

Authors:Jianhui Wei, Zijie Meng, Zikai Xiao, Tianxiang Hu, Yang Feng, Zhijie Zhou, Jian Wu, Zuozhu Liu
Title: MedEthicsQA: A Comprehensive Question Answering Benchmark for Medical Ethics Evaluation of LLMs
Abstract:
While Medical Large Language Models (MedLLMs) have demonstrated remarkable potential in clinical tasks, their ethical safety remains insufficiently explored. This paper introduces $\textbf{MedEthicsQA}$, a comprehensive benchmark comprising $\textbf{5,623}$ multiple-choice questions and $\textbf{5,351}$ open-ended questions for evaluation of medical ethics in LLMs. We systematically establish a hierarchical taxonomy integrating global medical ethical standards. The benchmark encompasses widely used medical datasets, authoritative question banks, and scenarios derived from PubMed literature. Rigorous quality control involving multi-stage filtering and multi-faceted expert validation ensures the reliability of the dataset with a low error rate ($2.72\%$). Evaluation of state-of-the-art MedLLMs exhibit declined performance in answering medical ethics questions compared to their foundation counterparts, elucidating the deficiencies of medical ethics alignment. The dataset, registered under CC BY-NC 4.0 license, is available at https://github.com/JianhuiWei7/MedEthicsQA.
中文摘要:本文提出了MedEthicsQA这一全面评估大语言模型医疗伦理能力的基准,发现尽管经过严格质量把控,现有医疗大模型在伦理问题上的表现仍逊于其基础版本。
English Summary: The paper introduces MedEthicsQA, a comprehensive benchmark for evaluating medical ethics in large language models, revealing that current MedLLMs perform worse on ethical questions compared to their base versions despite rigorous dataset validation.

Authors:Oguzhan Baser, Ahmet Ege Tanriverdi, Sriram Vishwanath, Sandeep P. Chinchali
Title: PhonemeFake: Redefining Deepfake Realism with Language-Driven Segmental Manipulation and Adaptive Bilevel Detection
Abstract:
Deepfake (DF) attacks pose a growing threat as generative models become increasingly advanced. However, our study reveals that existing DF datasets fail to deceive human perception, unlike real DF attacks that influence public discourse. It highlights the need for more realistic DF attack vectors. We introduce PhonemeFake (PF), a DF attack that manipulates critical speech segments using language reasoning, significantly reducing human perception by up to 42% and benchmark accuracies by up to 94%. We release an easy-to-use PF dataset on HuggingFace and open-source bilevel DF segment detection model that adaptively prioritizes compute on manipulated regions. Our extensive experiments across three known DF datasets reveal that our detection model reduces EER by 91% while achieving up to 90% speed-up, with minimal compute overhead and precise localization beyond existing models as a scalable solution.
中文: 深度伪造攻击日益严重,但现有数据集缺乏真实性,因此我们提出了PhonemeFake,通过语言推理操纵关键语音段,显著降低人类感知和基准准确率,并推出高效检测模型,提升性能与速度。
English: Deepfake attacks are advancing, but current datasets lack realism, prompting the introduction of PhonemeFake, which manipulates speech segments to significantly reduce human perception and benchmark accuracy, alongside an efficient detection model that improves performance and speed.

Authors:Brian Mak, Jeffrey Flanigan
Title: Residual Matrix Transformers: Scaling the Size of the Residual Stream
Abstract:
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.
残差矩阵变换器(RMT)用外积记忆矩阵替代了标准残差流,以更少的资源实现了更高的效率和性能,同时允许残差流独立扩展。
The Residual Matrix Transformer (RMT) replaces the standard residual stream with an outer product memory matrix, achieving superior efficiency and performance with fewer resources while enabling independent scaling of the residual stream.

Authors:Chenyang Shao, Tianxing Li, Chenhao Pu, Fengli Xu, Yong Li
Title: AgentStealth: Reinforcing Large Language Model for Anonymizing User-generated Text
Abstract:
In today's digital world, casual user-generated content often contains subtle cues that may inadvertently expose sensitive personal attributes. Such risks underscore the growing importance of effective text anonymization to safeguard individual privacy. However, existing methods either rely on rigid replacements that damage utility or cloud-based LLMs that are costly and pose privacy risks. To address these issues, we explore the use of locally deployed smaller-scale language models (SLMs) for anonymization. Yet training effective SLMs remains challenging due to limited high-quality supervision. To address the challenge, we propose AgentStealth, a self-reinforcing LLM anonymization framework.First, we introduce an adversarial anonymization workflow enhanced by In-context Contrastive Learning and Adaptive Utility-Aware Control. Second, we perform supervised adaptation of SLMs using high-quality data collected from the workflow, which includes both anonymization and attack signals. Finally, we apply online reinforcement learning where the model leverages its internal adversarial feedback to iteratively improve anonymization performance. Experiments on two datasets show that our method outperforms baselines in both anonymization effectiveness (+12.3%) and utility (+6.8%). Our lightweight design supports direct deployment on edge devices, avoiding cloud reliance and communication-based privacy risks. Our code is open-source at https://github.com/tsinghua-fib-lab/AgentStealth.
中文摘要:AgentStealth是一种基于本地部署小型语言模型的文本匿名化框架,通过对抗性学习和强化训练,在保护隐私的同时显著提升了数据可用性。
English Summary: AgentStealth is a novel framework using locally deployed small language models to anonymize text, achieving superior privacy protection and utility through adversarial learning and reinforcement techniques.

Authors:Petr Pechman, Milan Straka, Jana Straková, Jakub Náplava
Title: Refining Czech GEC: Insights from a Multi-Experiment Approach
Abstract:
We present a grammar error correction (GEC) system that achieves state of the art for the Czech language. Our system is based on a neural network translation approach with the Transformer architecture, and its key feature is its real-time synthetic generation pipeline, which dynamically augments sentences with artificial errors by introducing both language-agnostic and Czech-specific errors. We conduct a comprehensive series of experiments, investigating the Czech GEC corpora as bases for synthetic error introduction, several error generation strategies, domain balancing, tokenization granularity, model size, and data scaling during fine-tuning. Additionally, we evaluate the performance of large language models (LLMs) on Czech GEC in both end-user and expert fine-tuning scenarios. Our best-performing model is superior both in performance and computational efficiency. The source code and the trained model links are available on https://github.com/ufal/tsd2025-gec.
中文: 我们推出了一种基于Transformer架构的捷克语语法纠错系统,通过实时合成错误生成流水线和全面实验,在性能和效率上均优于现有模型。
English: We introduce a state-of-the-art Czech grammar error correction system using a Transformer-based neural network, featuring a real-time synthetic error generation pipeline and comprehensive experiments that outperform existing models in both performance and efficiency.

Authors:Youkang Wang, Jian Wang, Rubing Chen, Xiao-Yong Wei
Title: Probabilistic Optimality for Inference-time Scaling
Abstract:
Inference-time scaling has emerged as a powerful technique for enhancing the reasoning performance of Large Language Models (LLMs). However, existing approaches often rely on heuristic strategies for parallel sampling, lacking a principled foundation. To address this gap, we propose a probabilistic framework that formalizes the optimality of inference-time scaling under the assumption that parallel samples are independently and identically distributed (i.i.d.), and where the Best-of-N selection strategy follows a probability distribution that can be estimated. Within this framework, we derive a theoretical lower bound on the required number of samples to achieve a target performance level, providing the first principled guidance for compute-efficient scaling. Leveraging this insight, we develop OptScale, a practical algorithm that dynamically determines the optimal number of sampled responses. OptScale employs a language model-based predictor to estimate probabilistic prior parameters, enabling the decision of the minimal number of samples needed that satisfy predefined performance thresholds and confidence levels. Extensive experiments on mathematical reasoning benchmarks (including MATH-500, GSM8K, AIME, and AMC) demonstrate that OptScale significantly reduces sampling overhead while remaining better or on par with state-of-the-art reasoning performance. Our work offers both a theoretical foundation and a practical solution for principled inference-time scaling, addressing a critical gap in the efficient deployment of LLMs for complex reasoning. The source code is publicly available at https://github.com/Albertwyk/OptScale.
中文摘要:本文提出了OptScale概率框架,首次从理论上推导出实现目标推理性能所需的最小采样数量,在显著降低计算开销的同时保持最优性能,为LLM的高效推理提供了原理性指导。
English Summary: This paper introduces OptScale, a probabilistic framework that theoretically determines the minimum number of parallel samples needed for efficient inference-time scaling in LLMs, significantly reducing computational overhead while maintaining state-of-the-art reasoning performance.

Authors:Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt
Title: COOCO -- Common Objects Out-of-Context -- Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication
Abstract:
Natural scenes provide us with rich contexts for object recognition and reference. In particular, knowing what type of scene one is looking at generates expectations about which objects will occur, and what their spatial configuration should be. Do Vision-Language Models (VLMs) learn to rely on scene contexts in a similar way, when generating references to objects? To address this question, we introduce the \textit{Common Objects Out-of-Context (COOCO)} dataset and test to what extent VLMs rely on scene context to refer to objects under different degrees of scene-object congruency, and different perturbations. Our findings show that models leverage scene context adaptively, depending on both the semantic relatedness between object and scene and the level of noise. In particular, models rely more on context under high target-scene congruence or when objects are degraded. Attention analysis reveals that successful object categorisation involves increased focus on the target in mid-level layers, especially under moderate noise, suggesting that VLMs dynamically balance local and contextual information for reference generation. We make our dataset, code and models available at \href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}.
中文摘要:本研究探讨视觉语言模型是否像人类一样利用场景上下文进行物体指代,发现模型会根据物体与场景的语义关联度及干扰程度自适应地依赖上下文,其注意力机制能动态平衡局部与整体信息。
English Summary: This study investigates whether Vision-Language Models utilize scene context for object reference like humans do, finding they adaptively rely on context based on object-scene congruence and noise levels, with attention mechanisms dynamically balancing local and contextual information.

Authors:Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang
Title: GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
Abstract:
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series, predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While being stable during pretraining and scalable to large model sizes, Pre-LN suffers from an exponential growth in activation variance across layers, causing the shortcut to dominate over sub-layer outputs in the residual connection and limiting the learning capacity of deeper layers. To mitigate this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple technique that can be used in combination with existing approaches. GPAS works by scaling down the intermediate activations while keeping their gradients unchanged. This leaves information in the activations intact, and avoids the gradient vanishing problem associated with gradient downscaling. Extensive experiments across various model sizes from 71M to 1B show that GPAS achieves consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also shows promise in improving alternative architectures such as Sandwich-LN and DeepNorm, demonstrating its versatility and potential for improving training dynamics in a wide range of settings. Our code is available at https://github.com/dandingsky/GPAS.
中文摘要:本文提出梯度保持激活缩放(GPAS)方法,通过缩放中间激活值同时保持梯度不变,解决Pre-LayerNorm Transformer中激活方差指数增长问题,在不同规模模型中均实现了性能提升。
English Summary: The paper introduces Gradient-Preserving Activation Scaling (GPAS), a technique that addresses the exponential activation variance growth in Pre-LayerNorm Transformers by scaling down intermediate activations while preserving gradients, achieving consistent performance improvements across various model sizes.

Authors:Junho Myung, Yeon Su Park, Sunwoo Kim, Shin Yoo, Alice Oh
Title: PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory
Abstract:
Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs' decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at https://github.com/yeonsuuuu28/papers-please.
中文摘要:PapersPlease基准通过基于ERG理论的3700个道德困境发现,大型语言模型在扮演移民官员时表现出系统性偏见,既呈现决策偏好模式,又对边缘化身份显示更高拒绝率。
English Summary: The PapersPlease benchmark uses 3,700 moral dilemmas based on ERG theory to reveal how large language models exhibit systematic biases when role-playing as immigration inspectors, showing preferential treatment patterns and heightened denial rates for marginalized identities.

Authors:Hang Shao, Heting Gao, Yunhang Shen, Jiawei Chen, Lijiang Li, Zuwei Long, Bo Tong, Ke Li, Xing Sun
Title: DeepTalk: Towards Seamless and Smart Speech Interaction with Adaptive Modality-Specific MoE
Abstract:
Native multimodal large language models (MLLMs) restructure a single large language model (LLM) into a spoken language model (SLM) capable of both speech and text generation. Compared to modular and aligned MLLMs, native MLLMs preserve richer paralinguistic features such as emotion and prosody, and generate speech responses directly within the backbone LLM rather than using a separate speech decoder. This integration also results in lower response latency and smoother interaction. However, native MLLMs suffer from catastrophic forgetting and performance degradation because the available paired speech-text data is insufficient to support the pretraining of MLLMs compared to the vast amount of text data required to pretrain text LLMs. To address this issue, we propose DeepTalk, a framework for adaptive modality expert learning based on a Mixture of Experts (MoE) architecture. DeepTalk first adaptively distinguishes modality experts according to their modality load within the LLM. Each modality expert then undergoes specialized single-modality training, followed by joint multimodal collaborative training. As a result, DeepTalk incurs only a 5.5% performance drop compared to the original LLM, which is significantly lower than the average performance drop of over 20% typically seen in native MLLMs (such as GLM-4-Voice), and is on par with modular MLLMs. Meanwhile, the end-to-end dialogue latency remains within 0.5 seconds, ensuring a seamless and intelligent speech interaction experience. Code and models are released at https://github.com/talkking/DeepTalk.
中文: 原生多模态大语言模型将语音与文本生成整合于单一模型内,保留了副语言特征并降低了延迟,但受限于配对数据不足导致性能下降;DeepTalk通过自适应模态专家学习框架,显著减少了性能损失并保持了流畅的交互体验。
English: Native multimodal large language models integrate speech and text generation directly within a single model, preserving paralinguistic features and reducing latency, but face performance issues due to limited paired data, which DeepTalk addresses through adaptive modality expert learning to minimize performance degradation and maintain efficient interaction.

Authors:Alexandru Dumitru, V Venktesh, Adam Jatowt, Avishek Anand
Title: Evaluating List Construction and Temporal Understanding capabilities of Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated immense advances in a wide range of natural language tasks. However, these models are susceptible to hallucinations and errors on particularly temporal understanding tasks involving multiple entities in answers. In such tasks, they fail to associate entities with accurate time intervals, generate a complete list of entities in answers or reason about events associated with specific temporal bounds. Existing works do not extensively evaluate the abilities of the model to perform implicit and explicit temporal understanding in a list answer construction setup. To bridge this gap, we propose the Time referenced List based Question Answering or TLQA benchmark that requires structured answers in list format aligned with corresponding time periods. Our TLQA benchmark, requires both list construction and temporal understanding simultaneously, which to the best of our knowledge has not been explored in prior benchmarks. We investigate the temporal understanding and list construction capabilities of state-of-the-art generative models on TLQA in closed-book and open-domain settings. Our findings reveal significant shortcomings in current models, particularly their inability to provide complete answers and temporally align facts in a closed-book setup and the need to improve retrieval in open-domain setup, providing clear future directions for research on TLQA. The benchmark and code at https://github.com/elixir-research-group/TLQA.
中文: 大语言模型在时间理解和列表构建方面存在明显不足,为此提出的TLQA基准测试揭示了当前模型在闭卷和开放域设置中的重大缺陷,为未来研究指明了方向。
English: Large Language Models struggle with temporal understanding and list construction in question answering, leading to the creation of the TLQA benchmark which reveals significant model shortcomings in both closed-book and open-domain settings.

Authors:Eivind Morris Bakke, Nora Winger Heggelund
Title: (Fact) Check Your Bias
Abstract:
Automatic fact verification systems increasingly rely on large language models (LLMs). We investigate how parametric knowledge biases in these models affect fact-checking outcomes of the HerO system (baseline for FEVER-25). We examine how the system is affected by: (1) potential bias in Llama 3.1's parametric knowledge and (2) intentionally injected bias. When prompted directly to perform fact-verification, Llama 3.1 labels nearly half the claims as "Not Enough Evidence". Using only its parametric knowledge it is able to reach a verdict on the remaining half of the claims. In the second experiment, we prompt the model to generate supporting, refuting, or neutral fact-checking documents. These prompts significantly influence retrieval outcomes, with approximately 50\% of retrieved evidence being unique to each perspective. Notably, the model sometimes refuses to generate supporting documents for claims it believes to be false, creating an inherent negative bias. Despite differences in retrieved evidence, final verdict predictions show stability across prompting strategies. The code is available at: https://github.com/eibakke/FEVER-8-Shared-Task
中文: 本研究探讨了Llama 3.1模型的参数知识偏差对HerO事实核查系统的影响,发现直接提示会导致大量“证据不足”判定,而人为注入的偏差虽显著改变证据检索结果,却未对最终判定产生实质性影响。
English: This study examines how parametric knowledge biases in Llama 3.1 affect the HerO fact-verification system, revealing that direct prompting leads to high rates of "Not Enough Evidence" classifications while injected bias significantly alters evidence retrieval without substantially changing final verdicts.

Authors:Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, Jingchen Shu
Title: IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Abstract:
Existing autoregressive large-scale text-to-speech (TTS) models have advantages in speech naturalness, but their token-by-token generation mechanism makes it difficult to precisely control the duration of synthesized speech. This becomes a significant limitation in applications requiring strict audio-visual synchronization, such as video dubbing. This paper introduces IndexTTS2, which proposes a novel, general, and autoregressive model-friendly method for speech duration control. The method supports two generation modes: one explicitly specifies the number of generated tokens to precisely control speech duration; the other freely generates speech in an autoregressive manner without specifying the number of tokens, while faithfully reproducing the prosodic features of the input prompt. Furthermore, IndexTTS2 achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion. In the zero-shot setting, the model can accurately reconstruct the target timbre (from the timbre prompt) while perfectly reproducing the specified emotional tone (from the style prompt). To enhance speech clarity in highly emotional expressions, we incorporate GPT latent representations and design a novel three-stage training paradigm to improve the stability of the generated speech. Additionally, to lower the barrier for emotional control, we designed a soft instruction mechanism based on text descriptions by fine-tuning Qwen3, effectively guiding the generation of speech with the desired emotional orientation. Finally, experimental results on multiple datasets show that IndexTTS2 outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity. Audio samples are available at: https://index-tts.github.io/index-tts2.github.io/

Authors:Junhao Liu, Zhenhao Xu, Yuxin Fang, Yichuan Chen, Zuobin Ying, Wenhan Chang
Title: From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models
Abstract:
Recently, there have been notable advancements in large language models (LLMs), demonstrating their growing abilities in complex reasoning. However, existing research largely overlooks a thorough and systematic comparison of these models' reasoning processes and outputs, particularly regarding their self-reflection pattern (also termed "Aha moment") and the interconnections across diverse domains. This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models (GPT-o1, DeepSeek-R1, Kimi-k1.5, and Grok-3) using keywords statistic and LLM-as-a-judge paradigm. Our approach connects their internal thinking processes with their final outputs. A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving. Additionally, a set of metrics is put forward to assess both the coherence of reasoning and the accuracy of the outputs. The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions during the reasoning process. Through quantitative and qualitative comparisons, disparities among these models are identified in aspects such as the depth of reasoning, the reliance on intermediate steps, and the degree of similarity between their thinking processes and output patterns and those of GPT-o1. This work offers valuable insights into the trade-off between computational efficiency and reasoning robustness and provides practical recommendations for enhancing model design and evaluation in practical applications. We publicly release our project at: https://github.com/ChangWenhan/FromThinking2Output
中文: 尽管大语言模型在复杂推理方面取得显著进展,但现有研究缺乏对其推理过程与输出的系统比较;本文通过关键词统计和LLM评估框架分析四大前沿模型,揭示了它们在推理深度与效率上的差异,并为模型优化提供了实用建议。
English: Recent advances in large language models show enhanced reasoning capabilities, yet a systematic comparison of their reasoning processes and outputs is lacking, prompting this study to analyze four top models using keyword statistics and LLM evaluation, revealing differences in reasoning depth and efficiency while offering insights for model improvement.

Authors:Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, Zhenhua Dong
Title: MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents
Abstract:
Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains challenges. Previous evaluations are commonly limited by the diversity of memory levels and interactive scenarios. They also lack comprehensive metrics to reflect the memory capabilities from multiple aspects. To address these problems, in this paper, we construct a more comprehensive dataset and benchmark to evaluate the memory capability of LLM-based agents. Our dataset incorporates factual memory and reflective memory as different levels, and proposes participation and observation as various interactive scenarios. Based on our dataset, we present a benchmark, named MemBench, to evaluate the memory capability of LLM-based agents from multiple aspects, including their effectiveness, efficiency, and capacity. To benefit the research community, we release our dataset and project at https://github.com/import-myself/Membench.
中文: 本文提出了MemBench,这是一个全面的数据集和基准,旨在从多个方面评估基于LLM的智能体在不同记忆层次和交互场景下的记忆能力,解决了以往评估在多样性和指标上的不足。
English: This paper introduces MemBench, a comprehensive dataset and benchmark designed to evaluate the memory capabilities of LLM-based agents across different memory levels and interactive scenarios, addressing previous limitations in diversity and metrics.

Authors:Josefa Lia Stoisser, Marc Boubnovski Martell, Lawrence Phillips, Casper Hansen, Julien Fauqueur
Title: STRuCT-LLM: Unifying Tabular and Graph Reasoning with Reinforcement Learning for Semantic Parsing
Abstract:
We propose STRuCT-LLM, a unified framework for training large language models (LLMs) to perform structured reasoning over both relational and graph-structured data. Our approach jointly optimizes Text-to-SQL and Text-to-Cypher tasks using reinforcement learning (RL) combined with Chain-of-Thought (CoT) supervision. To support fine-grained optimization in graph-based parsing, we introduce a topology-aware reward function based on graph edit distance. Unlike prior work that treats relational and graph formalisms in isolation, STRuCT-LLM leverages shared abstractions between SQL and Cypher to induce cross-formalism transfer, enabling SQL training to improve Cypher performance and vice versa - even without shared schemas. Our largest model (QwQ-32B) achieves substantial relative improvements across tasks: on semantic parsing, Spider improves by 13.5\% and Text2Cypher by 73.1\%. The model also demonstrates strong zero-shot generalization, improving performance on downstream tabular QA (TableBench: 8.5\%) and knowledge graph QA (CR-LT-KGQA: 1.7\%) without any QA-specific supervision. These results demonstrate both the effectiveness of executable queries as scaffolds for structured reasoning and the synergistic benefits of jointly training on SQL and Cypher (code available at https://github.com/bouv/STRuCT-LLM).
中文:STRuCT-LLM是一个统一框架,通过强化学习和思维链监督联合优化Text-to-SQL与Text-to-Cypher任务,训练大语言模型对关系型和图结构数据进行结构化推理,实现了显著性能提升和强大的零样本泛化能力。
English: STRuCT-LLM is a unified framework that trains large language models for structured reasoning across relational and graph data by jointly optimizing Text-to-SQL and Text-to-Cypher tasks with reinforcement learning and Chain-of-Thought supervision, achieving significant performance improvements and strong zero-shot generalization.

Authors:Jianshuo Dong, Yujia Fu, Chuanrui Hu, Chao Zhang, Han Qiu
Title: Towards Understanding the Cognitive Habits of Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs), which autonomously produce a reasoning Chain of Thought (CoT) before producing final responses, offer a promising approach to interpreting and monitoring model behaviors. Inspired by the observation that certain CoT patterns -- e.g., ``Wait, did I miss anything?'' -- consistently emerge across tasks, we explore whether LRMs exhibit human-like cognitive habits. Building on Habits of Mind, a well-established framework of cognitive habits associated with successful human problem-solving, we introduce CogTest, a principled benchmark designed to evaluate LRMs' cognitive habits. CogTest includes 16 cognitive habits, each instantiated with 25 diverse tasks, and employs an evidence-first extraction method to ensure reliable habit identification. With CogTest, we conduct a comprehensive evaluation of 16 widely used LLMs (13 LRMs and 3 non-reasoning ones). Our findings reveal that LRMs, unlike conventional LLMs, not only exhibit human-like habits but also adaptively deploy them according to different tasks. Finer-grained analyses further uncover patterns of similarity and difference in LRMs' cognitive habit profiles, particularly certain inter-family similarity (e.g., Qwen-3 models and DeepSeek-R1). Extending the study to safety-related tasks, we observe that certain habits, such as Taking Responsible Risks, are strongly associated with the generation of harmful responses. These findings suggest that studying persistent behavioral patterns in LRMs' CoTs is a valuable step toward deeper understanding of LLM misbehavior. The code is available at: https://github.com/jianshuod/CogTest.
中文摘要:大型推理模型(LRMs)通过CogTest基准测试展现出类人的认知习惯,这些习惯能根据不同任务自适应调整,且某些习惯(如“承担风险”)与生成有害内容密切相关。
English Summary: Large Reasoning Models (LRMs) demonstrate human-like cognitive habits that adapt to different tasks, as revealed by the CogTest benchmark, which also links certain habits to safety risks in model responses.

Authors:Baqer M. Merzah, Tania Taami, Salman Asoudeh, Saeed Mirzaee, Amir reza Hossein pour, Amir Ali Bengari
Title: BioPars: A Pretrained Biomedical Large Language Model for Persian Biomedical Text Mining
Abstract:
Large Language Models (LLMs) have recently gained attention in the life sciences due to their capacity to model, extract, and apply complex biological information. Beyond their classical use as chatbots, these systems are increasingly used for complex analysis and problem-solving in specialized fields, including bioinformatics. First, we introduce BIOPARS-BENCH, a dataset from over 10,000 scientific articles, textbooks, and medical websites. BioParsQA was also introduced to evaluate the proposed model, which consists of 5,231 Persian medical questions and answers. This study then introduces BioPars, a simple but accurate measure designed to assess LLMs for three main abilities: acquiring subject-specific knowledge, interpreting and synthesizing such knowledge, and demonstrating proper evidence. Comparing ChatGPT, Llama, and Galactica, our study highlights their ability to remember and retrieve learned knowledge but also reveals shortcomings in addressing higher-level, real-world questions and fine-grained inferences. These findings indicate the need for further fine-tuning to address the capabilities of LLM in bioinformatics tasks. To our knowledge, BioPars is the first application of LLM in Persian medical QA, especially for generating long answers. Evaluation of four selected medical QA datasets shows that BioPars has achieved remarkable results compared to comparative approaches. The model on BioParsQA achieved a ROUGE-L score of 29.99, which is an improvement over GPT-4 1.0. The model achieved a BERTScore of 90.87 with the MMR method. The MoverScore and BLEURT values were also higher in this model than the other three models. In addition, the reported scores for the model are MoverScore=60.43 and BLEURT=50.78. BioPars is an ongoing project and all resources related to its development will be made available via the following GitHub repository: https://github.com/amirap80/BioPars.
中文摘要:本研究提出BioPars框架用于评估大型语言模型在波斯语医学问答中的表现,结果显示其优于现有模型,同时揭示了模型在处理复杂生物医学推理方面的局限性。
English Summary: This study introduces BioPars, a framework for evaluating large language models in Persian medical question-answering, demonstrating superior performance over existing models while highlighting current limitations in handling complex biomedical reasoning.

Authors:Jiyan Liu, Youzheng Liu, Taihang Wang, Xiaoman Xu, Yimin Wang, Ye Jiang
Title: Team QUST at SemEval-2025 Task 10: Evaluating Large Language Models in Multiclass Multi-label Classification of News Entity Framing
Abstract:
This paper describes the participation of QUST_NLP in the SemEval-2025 Task 7. We propose a three-stage retrieval framework specifically designed for fact-checked claim retrieval. Initially, we evaluate the performance of several retrieval models and select the one that yields the best results for candidate retrieval. Next, we employ multiple re-ranking models to enhance the candidate results, with each model selecting the Top-10 outcomes. In the final stage, we utilize weighted voting to determine the final retrieval outcomes. Our approach achieved 5th place in the monolingual track and 7th place in the crosslingual track. We release our system code at: https://github.com/warmth27/SemEval2025_Task7.
中文: 本文介绍了QUST_NLP团队为事实核查声明检索设计的三阶段检索框架,在SemEval-2025任务7的单语和跨语种赛道中分别获得第五和第七名。
English: This paper presents QUST_NLP's three-stage retrieval framework for fact-checked claim retrieval, which achieved 5th and 7th places in SemEval-2025 Task 7's monolingual and crosslingual tracks respectively.

Authors:Xinzhuo Li, Adheesh Juvekar, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Ismini Lourentzou
Title: HalluSegBench: Counterfactual Visual Reasoning for Segmentation Hallucination Evaluation
Abstract:
Recent progress in vision-language segmentation has significantly advanced grounded visual understanding. However, these models often exhibit hallucinations by producing segmentation masks for objects not grounded in the image content or by incorrectly labeling irrelevant regions. Existing evaluation protocols for segmentation hallucination primarily focus on label or textual hallucinations without manipulating the visual context, limiting their capacity to diagnose critical failures. In response, we introduce HalluSegBench, the first benchmark specifically designed to evaluate hallucinations in visual grounding through the lens of counterfactual visual reasoning. Our benchmark consists of a novel dataset of 1340 counterfactual instance pairs spanning 281 unique object classes, and a set of newly introduced metrics that quantify hallucination sensitivity under visually coherent scene edits. Experiments on HalluSegBench with state-of-the-art vision-language segmentation models reveal that vision-driven hallucinations are significantly more prevalent than label-driven ones, with models often persisting in false segmentation, highlighting the need for counterfactual reasoning to diagnose grounding fidelity.

Authors:Akshay Paruchuri, Maryam Aziz, Rohit Vartak, Ayman Ali, Best Uchehara, Xin Liu, Ishan Chatterjee, Monica Agrawal
Title: "What's Up, Doc?": Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
Abstract:
People are increasingly seeking healthcare information from large language models (LLMs) via interactive chatbots, yet the nature and inherent risks of these conversations remain largely unexplored. In this paper, we filter large-scale conversational AI datasets to achieve HealthChat-11K, a curated dataset of 11K real-world conversations composed of 25K user messages. We use HealthChat-11K and a clinician-driven taxonomy for how users interact with LLMs when seeking healthcare information in order to systematically study user interactions across 21 distinct health specialties. Our analysis reveals insights into the nature of how and why users seek health information, such as common interactions, instances of incomplete context, affective behaviors, and interactions (e.g., leading questions) that can induce sycophancy, underscoring the need for improvements in the healthcare support capabilities of LLMs deployed as conversational AI. Code and artifacts to retrieve our analyses and combine them into a curated dataset can be found here: https://github.com/yahskapar/HealthChat
中文摘要:本研究通过分析1.1万条真实医疗对话数据,揭示了用户与AI交互时存在信息不完整、诱导性提问等风险,强调需提升医疗对话AI的辅助能力。
English Summary: This study analyzes 11,000 real-world health conversations with AI chatbots, revealing user interaction patterns and risks like incomplete context and sycophancy that highlight the need for improved healthcare AI capabilities.

Authors:Marek Šuppa, Andrej Ridzik, Daniel Hládek, Tomáš Javůrek, Viktória Ondrejová, Kristína Sásiková, Martin Tamajka, Marián Šimko
Title: skLEP: A Slovak General Language Understanding Benchmark
Abstract:
In this work, we introduce skLEP, the first comprehensive benchmark specifically designed for evaluating Slovak natural language understanding (NLU) models. We have compiled skLEP to encompass nine diverse tasks that span token-level, sentence-pair, and document-level challenges, thereby offering a thorough assessment of model capabilities. To create this benchmark, we curated new, original datasets tailored for Slovak and meticulously translated established English NLU resources. Within this paper, we also present the first systematic and extensive evaluation of a wide array of Slovak-specific, multilingual, and English pre-trained language models using the skLEP tasks. Finally, we also release the complete benchmark data, an open-source toolkit facilitating both fine-tuning and evaluation of models, and a public leaderboard at https://github.com/slovak-nlp/sklep in the hopes of fostering reproducibility and drive future research in Slovak NLU.
Chinese: 本文介绍了skLEP,这是首个专为评估斯洛伐克自然语言理解模型设计的综合基准,包含九项多样化任务和原始数据集,通过对多种模型进行全面评估并公开所有资源,旨在推动该领域的未来研究。
English: This paper introduces skLEP, the first comprehensive benchmark for evaluating Slovak natural language understanding models, featuring nine diverse tasks and original datasets, along with an extensive evaluation of various models and the release of all resources to promote future research.

Authors:Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Title: Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Abstract:
Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.
中文:智能搜索系统如深度研究能自主浏览并整合网络信息,但其复杂性超越了现有评估方法,为此开发了Mind2Web 2基准和“代理即裁判”框架进行自动评估,顶尖系统仅用一半时间即可达到人类50-70%的效能。
English: Agentic search systems like Deep Research autonomously browse and synthesize web information, but their complexity exceeds current evaluation methods, leading to the creation of Mind2Web 2 benchmark and an Agent-as-a-Judge framework for automated assessment, showing top systems achieve 50-70% of human efficiency in half the time.

Authors:Istabrak Abbes, Gabriele Prato, Quentin Fournier, Fernando Rodriguez, Alaa Boukhary, Adam Elwood, Sarath Chandar
Title: Small Encoders Can Rival Large Decoders in Detecting Groundedness
Abstract:
Augmenting large language models (LLMs) with external context significantly improves their performance in natural language processing (NLP) tasks. However, LLMs struggle to answer queries reliably when the provided context lacks information, often resorting to ungrounded speculation or internal knowledge. Groundedness - generating responses strictly supported by the context - is essential for ensuring factual consistency and trustworthiness. This study focuses on detecting whether a given query is grounded in a document provided in context before the costly answer generation by LLMs. Such a detection mechanism can significantly reduce both inference time and resource consumption. We show that lightweight, task specific encoder models such as RoBERTa and NomicBERT, fine-tuned on curated datasets, can achieve accuracy comparable to state-of-the-art LLMs, such as Llama3 8B and GPT4o, in groundedness detection while reducing inference latency by orders of magnitude. The code is available at : https://github.com/chandarlab/Hallucinate-less
中文摘要:增强大型语言模型的外部上下文可提升其自然语言处理性能,但确保回答严格基于上下文仍具挑战;本研究提出一种轻量级编码器模型,在昂贵的LLM生成答案前检测查询是否基于文档,能以极低延迟和资源消耗达到与先进LLMs相当的检测精度。
English Summary: Augmenting LLMs with external context boosts NLP performance, but ensuring grounded responses remains challenging; this study introduces a lightweight encoder model that detects query grounding in documents before costly LLM processing, achieving comparable accuracy to advanced LLMs while drastically cutting latency and resource use.

Authors:Xin Xu, Tianhao Chen, Fan Zhang, Wanlong Liu, Pengxiang Li, Ajay Kumar Jaiswal, Yuchen Yan, Jishan Hu, Yang Wang, Hao Chen, Shiwei Liu, Shizhe Diao, Can Yang, Lu Yin
Title: Double-Checker: Enhancing Reasoning of Slow-Thinking LLMs via Self-Critical Fine-Tuning
Abstract:
While slow-thinking large language models (LLMs) exhibit reflection-like reasoning, commonly referred to as the "aha moment:, their ability to generate informative critiques and refine prior solutions remains limited. In this paper, we introduce Double-Checker, a principled framework designed to enhance the reasoning capabilities of slow-thinking LLMs by fostering explicit self-critique and iterative refinement of their previous solutions. By fine-tuning on our curated 1,730 self-critical instances, Double-Checker empowers long-CoT LLMs to iteratively critique and refine their outputs during inference until they evaluate their solutions as correct under self-generated critiques. We validate the efficacy of Double-Checker across a comprehensive suite of reasoning benchmarks, demonstrating that iterative self-critique significantly enhances the reasoning capabilities of long-CoT LLMs. Notably, our Double-Checker increases the pass@1 performance on challenging AIME benchmarks from 4.4% to 18.2% compared to the original long-CoT LLMs. These results highlight a promising direction for developing more trustworthy and effective LLMs capable of structured self-critique. Our codes and data are available at https://github.com/XinXU-USTC/DoubleChecker
中文: 本文提出的Double-Checker框架通过让慢思考大语言模型进行迭代式自我批判和答案优化,显著提升了推理能力,在AIME等基准测试中的通过率从4.4%提升至18.2%。
English: This paper introduces Double-Checker, a framework that enhances slow-thinking LLMs' reasoning by enabling iterative self-critique and refinement of solutions, significantly improving performance on reasoning benchmarks like AIME from 4.4% to 18.2%.

Authors:Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
Abstract:
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB's continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results' generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
中文: 本文阐述了提升大规模文本嵌入基准(MTEB)可复现性和可扩展性的工程实践,包括稳健的持续集成流程以及处理社区贡献与扩展数据集的方法。
English: This paper details the engineering practices that enhance the reproducibility and extensibility of the Massive Text Embedding Benchmark (MTEB), including robust continuous integration pipelines and strategies for community contributions and dataset expansion.

Authors:Tim Lawson, Laurence Aitchison
Title: Learning to Skip the Middle Layers of Transformers
Abstract:
Conditional computation is a popular strategy to make Transformers more efficient. Existing methods often target individual modules (e.g., mixture-of-experts layers) or skip layers independently of one another. However, interpretability research has demonstrated that the middle layers of Transformers exhibit greater redundancy, and that early layers aggregate information into token positions. Guided by these insights, we propose a novel architecture that dynamically skips a variable number of layers from the middle outward. In particular, a learned gating mechanism determines whether to bypass a symmetric span of central blocks based on the input, and a gated attention mechanism prevents subsequent tokens from attending to skipped token positions. Residual norms are controlled with a 'sandwich' or 'perilayernorm' scheme and gate sparsity with an adaptive regularization loss. We had aimed to reduce compute requirements for 'simpler' tokens and potentially foster an emergent multi-level representational hierarchy but, at the scales investigated, our approach does not achieve improvements in the trade-off between validation cross-entropy and estimated FLOPs compared to dense baselines with fewer layers. We release our code at https://github.com/tim-lawson/skip-middle.
中文: 本文提出了一种新型Transformer架构,通过学习的门控机制动态跳过中间层的可变数量,但与层数更少的基线模型相比,该方法未能在效率与准确性的权衡中取得改进。
English: This paper introduces a novel Transformer architecture that dynamically skips variable numbers of middle layers using a learned gating mechanism, though it fails to improve the efficiency-accuracy trade-off compared to fewer-layer baselines.

Authors:Jinming Wu, Zihao Deng, Wei Li, Yiding Liu, Bo You, Bo Li, Zejun Ma, Ziwei Liu
Title: MMSearch-R1: Incentivizing LMMs to Search
Abstract:
Robust deployment of large multimodal models (LMMs) in real-world scenarios requires access to external knowledge sources, given the complexity and dynamic nature of real-world information. Existing approaches such as retrieval-augmented generation (RAG) and prompt engineered search agents rely on rigid pipelines, often leading to inefficient or excessive search behaviors. We present MMSearch-R1, the first end-to-end reinforcement learning framework that enables LMMs to perform on-demand, multi-turn search in real-world Internet environments. Our framework integrates both image and text search tools, allowing the model to reason about when and how to invoke them guided by an outcome-based reward with a search penalty. To support training, We collect a multimodal search VQA dataset through a semi-automated pipeline that covers diverse visual and textual knowledge needs and curate a search-balanced subset with both search-required and search-free samples, which proves essential for shaping efficient and on-demand search behavior. Extensive experiments on knowledge-intensive and info-seeking VQA tasks show that our model not only outperforms RAG-based baselines of the same model size, but also matches the performance of a larger RAG-based model while reducing search calls by over 30%. We further analyze key empirical findings to offer actionable insights for advancing research in multimodal search.
中文:MMSearch-R1是一种创新的强化学习框架,它使大型多模态模型能够通过整合图像和文本工具进行高效、按需的搜索,在减少30%以上搜索调用的同时,性能超越了现有方法。
English: MMSearch-R1 is a novel reinforcement learning framework that enables large multimodal models to perform efficient, on-demand searches using integrated image and text tools, significantly reducing search calls while outperforming existing methods.

Authors:Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
Title: DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
Abstract:
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, \textbf{DiffuCoder}, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose \textbf{coupled-GRPO}, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR bias during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.
中文: 扩散大语言模型(dLLMs)通过全局规划和迭代优化为代码生成提供了新途径,本研究提出了基于1300亿代码标记训练的7B模型DiffuCoder及耦合GRPO强化学习方法,显著提升代码生成性能并降低自回归依赖。
English: Diffusion large language models (dLLMs) offer a novel approach to code generation with global planning and iterative refinement, and this study introduces DiffuCoder, a 7B model trained on 130B code tokens, along with a coupled-GRPO reinforcement learning method that enhances performance and reduces autoregressive bias.

Authors:Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Title: ReCode: Updating Code API Knowledge with Reinforcement Learning
Abstract:
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs' code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs' general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
中文摘要:ReCode是一种新颖的强化学习框架,通过基于版本迁移数据训练大语言模型,显著提升其在动态API环境中的代码生成可靠性,同时对其通用编程能力影响较小。
English Summary: ReCode is a novel reinforcement learning framework that significantly enhances large language models' ability to generate reliable code in dynamic API environments by training them on version migration data while minimizing impact on their general coding capabilities.

Authors:Guinan Su, Li Shen, Lu Yin, Shiwei Liu, Yanwu Yang, Jonas Geiping
Title: GPTailor: Large Language Model Pruning Through Layer Cutting and Stitching
Abstract:
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in deployment and inference. While structured pruning of model parameters offers a promising way to reduce computational costs at deployment time, current methods primarily focus on single model pruning. In this work, we develop a novel strategy to compress models by strategically combining or merging layers from finetuned model variants, which preserves the original model's abilities by aggregating capabilities accentuated in different finetunes. We pose the optimal tailoring of these LLMs as a zero-order optimization problem, adopting a search space that supports three different operations: (1) Layer removal, (2) Layer selection from different candidate models, and (3) Layer merging. Our experiments demonstrate that this approach leads to competitive model pruning, for example, for the Llama2-13B model families, our compressed models maintain approximately 97.3\% of the original performance while removing $\sim25\%$ of parameters, significantly outperforming previous state-of-the-art methods. The code is available at https://github.com/Guinan-Su/auto-merge-llm.
中文: 本研究提出了一种新颖的大语言模型分层压缩策略,通过分层移除、选择和合并的组合操作,在减少约25%参数的同时保持约97.3%的原始性能,显著优于现有方法。
English: This study introduces a novel layer-based compression strategy for large language models that combines layer removal, selection, and merging to reduce model size by approximately 25% while retaining about 97.3% of original performance, significantly outperforming existing methods.

Authors:Songsoo Kim, Seungtae Lee, See Young Lee, Joonho Kim, Keechan Kan, Dukyong Yoon
Title: A Multi-Pass Large Language Model Framework for Precise and Efficient Radiology Report Error Detection
Abstract:
Background: The positive predictive value (PPV) of large language model (LLM)-based proofreading for radiology reports is limited due to the low error prevalence. Purpose: To assess whether a three-pass LLM framework enhances PPV and reduces operational costs compared with baseline approaches. Materials and Methods: A retrospective analysis was performed on 1,000 consecutive radiology reports (250 each: radiography, ultrasonography, CT, MRI) from the MIMIC-III database. Two external datasets (CheXpert and Open-i) were validation sets. Three LLM frameworks were tested: (1) single-prompt detector; (2) extractor plus detector; and (3) extractor, detector, and false-positive verifier. Precision was measured by PPV and absolute true positive rate (aTPR). Efficiency was calculated from model inference charges and reviewer remuneration. Statistical significance was tested using cluster bootstrap, exact McNemar tests, and Holm-Bonferroni correction. Results: Framework PPV increased from 0.063 (95% CI, 0.036-0.101, Framework 1) to 0.079 (0.049-0.118, Framework 2), and significantly to 0.159 (0.090-0.252, Framework 3; P<.001 vs. baselines). aTPR remained stable (0.012-0.014; P>=.84). Operational costs per 1,000 reports dropped to USD 5.58 (Framework 3) from USD 9.72 (Framework 1) and USD 6.85 (Framework 2), reflecting reductions of 42.6% and 18.5%, respectively. Human-reviewed reports decreased from 192 to 88. External validation supported Framework 3's superior PPV (CheXpert 0.133, Open-i 0.105) and stable aTPR (0.007). Conclusion: A three-pass LLM framework significantly enhanced PPV and reduced operational costs, maintaining detection performance, providing an effective strategy for AI-assisted radiology report quality assurance.
中文:三阶段大语言模型框架显著提高了放射学报告校对中的阳性预测值并降低了运营成本,同时保持稳定的检测性能,为人工智能辅助的质量保证提供了有效策略。
English: A three-pass LLM framework significantly improves the positive predictive value and reduces operational costs for radiology report proofreading while maintaining stable detection performance, offering an effective AI-assisted quality assurance strategy.

Authors:Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, Zhongyu Wei
Title: ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset
Abstract:
Time-series data are critical in diverse applications, such as industrial monitoring, medical diagnostics, and climate research. However, effectively integrating these high-dimensional temporal signals with natural language for dynamic, interactive tasks remains a significant challenge. To address this, we introduce the Time-Series Question Answering (Time-Series QA) task and release EngineMT-QA, the first large-scale, multi-task, temporal-textual QA dataset designed to capture complex interactions between time-series signals and natural language. Building on this resource, we propose the Instruct Time Transformer (ITFormer), a novel framework that bridges time-series encoders with frozen large language models (LLMs). ITFormer effectively extracts, aligns, and fuses temporal and textual features, achieving a strong improvement in QA accuracy over strong baselines with fewer than 1\% additional trainable parameters. By combining computational efficiency with robust cross-modal modeling, our work establishes a adaptable paradigm for integrating temporal data with natural language, paving the way for new research and applications in multi-modal AI. More details about the project, including datasets and code, are available at: https://pandalin98.github.io/itformer_site/

Authors:Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin
Title: ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing
Abstract:
This paper presents ScaleCap, an inference-time scalable image captioning strategy that generates comprehensive and detailed image captions. The key challenges of high-quality image captioning lie in the inherent biases of LVLMs: multimodal bias resulting in imbalanced descriptive granularity, offering detailed accounts of some elements while merely skimming over others; linguistic bias leading to hallucinated descriptions of non-existent objects. To address these issues, we propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget. Specifically, we propose two novel components: heuristic question answering and contrastive sentence rating. The former generates content-specific questions based on the image and answers them to progressively inject relevant information into the caption. The latter employs sentence-level offline contrastive decoding to effectively identify and eliminate hallucinations caused by linguistic biases. With increased inference cost, more heuristic questions are raised by ScaleCap to progressively capture additional visual details, generating captions that are more accurate, balanced, and informative. Extensive modality alignment experiments demonstrate the effectiveness of ScaleCap. Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks. Furthermore, ScaleCap showcases superb richness and fidelity of generated captions with two additional tasks: replacing images with captions in VQA task, and reconstructing images from captions to assess semantic coverage. Code is available at https://github.com/Cooperx521/ScaleCap.
中文: ScaleCap是一种推理时可扩展的图像描述策略,通过启发式问答和对比语句评分解决多模态和语言偏见,随着推理成本增加逐步生成更准确、平衡且信息丰富的图像描述。
English: ScaleCap is an inference-time scalable image captioning strategy that addresses multimodal and linguistic biases in LVLMs through heuristic question answering and contrastive sentence rating, progressively generating more accurate and detailed captions with increased inference cost.

Authors:Yucheng Zhou, Lingran Song, Jianbing Shen
Title: MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration
Abstract:
Recent advancements in medical Large Language Models (LLMs) have showcased their powerful reasoning and diagnostic capabilities. Despite their success, current unified multimodal medical LLMs face limitations in knowledge update costs, comprehensiveness, and flexibility. To address these challenges, we introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM). Inspired by our empirical findings highlighting the benefits of role assignment and diagnostic discernment in LLMs, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director, each embodied by an LLM-based agent. This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases. Extensive experimental evaluations conducted on a wide range of publicly accessible multimodal medical datasets, incorporating text, image, audio, and video modalities, demonstrate that MAM consistently surpasses the performance of modality-specific LLMs. Notably, MAM achieves significant performance improvements ranging from 18% to 365% compared to baseline models. Our code is released at https://github.com/yczhou001/MAM.
中文: 模块化多智能体框架(MAM)通过分配专业角色优化医疗诊断流程,在多种模态医学数据集上实现了18%至365%的性能提升。
English: The Modular Multi-Agent Framework (MAM) introduces specialized LLM-based roles to enhance medical diagnosis, achieving performance improvements of 18% to 365% over baseline models across diverse multimodal datasets.

Authors:Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang
Title: Scaling Speculative Decoding with Lookahead Reasoning
Abstract:
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire $γ$-token guess is correct falls exponentially as $γ$ grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target regenerate any that fail. Token-level SD still operates within each reasoning step, so the two layers of parallelism multiply. We show Lookahead Reasoning lifts the peak speedup of SD both theoretically and empirically. Across GSM8K, AIME, and other benchmarks, Lookahead Reasoning improves the speedup of SD from 1.4x to 2.1x while preserving answer quality, and its speedup scales better with additional GPU throughput. Our code is available at https://github.com/hao-ai-lab/LookaheadReasoning
Chinese: 前瞻推理通过引入步骤级并行性改进了推测解码,允许草稿模型提出多个未来推理步骤并验证其语义正确性,与令牌级并行性相乘,从而在不影响答案质量的前提下显著提升了解码速度。
English: Lookahead Reasoning enhances speculative decoding by introducing step-level parallelism, allowing a draft model to propose multiple future reasoning steps and verifying their semantic correctness, which multiplies with token-level parallelism to significantly boost decoding speed without compromising answer quality.

Authors:Baochang Ren, Shuofei Qiao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Title: KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
Abstract:
Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.
Chinese: 针对慢思考大语言模型的严重幻觉问题,我们提出KnowRL方法,通过知识增强的强化学习引入事实性奖励机制,引导模型进行基于事实的慢思考,在保持推理能力的同时有效减少错误输出。
English: To address severe hallucinations in slow-thinking Large Language Models, we propose KnowRL, a knowledge-enhanced reinforcement learning method that integrates factuality rewards to guide fact-based reasoning and reduce incorrect outputs while preserving reasoning capabilities.

Authors:Yuqi Zhu, Yi Zhong, Jintian Zhang, Ziheng Zhang, Shuofei Qiao, Yujie Luo, Lun Du, Da Zheng, Ningyu Zhang, Huajun Chen
Title: Why Do Open-Source LLMs Struggle with Data Analysis? A Systematic Empirical Study
Abstract:
Large Language Models (LLMs) hold promise in automating data analysis tasks, yet open-source models face significant limitations in these kinds of reasoning-intensive scenarios. In this work, we investigate strategies to enhance the data analysis capabilities of open-source LLMs. By curating a seed dataset of diverse, realistic scenarios, we evaluate model behavior across three core dimensions: data understanding, code generation, and strategic planning. Our analysis reveals three key findings: (1) Strategic planning quality serves as the primary determinant of model performance; (2) Interaction design and task complexity significantly influence reasoning capabilities; (3) Data quality demonstrates a greater impact than diversity in achieving optimal performance. We leverage these insights to develop a data synthesis methodology, demonstrating significant improvements in open-source LLMs' analytical reasoning capabilities. Code is available at https://github.com/zjunlp/DataMind.
中文摘要:本研究通过揭示战略规划是性能关键驱动因素,开发了一种数据合成方法,显著提升了开源大语言模型的分析推理能力。
English Summary: This study enhances open-source LLMs' data analysis capabilities by identifying strategic planning as the key performance driver and developing a data synthesis method that significantly improves analytical reasoning.

Authors:Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, Jaewoo Kang
Title: Outlier-Safe Pre-Training for Robust 4-Bit Quantization of Large Language Models
Abstract:
Extreme activation outliers in Large Language Models (LLMs) critically degrade quantization performance, hindering efficient on-device deployment. While channel-wise operations and adaptive gradient scaling are recognized causes, practical mitigation remains challenging. We introduce Outlier-Safe Pre-Training (OSP), a practical guideline that proactively prevents outlier formation rather than relying on post-hoc mitigation. OSP combines three key innovations: (1) the Muon optimizer, eliminating privileged bases while maintaining training efficiency; (2) Single-Scale RMSNorm, preventing channel-wise amplification; and (3) a learnable embedding projection, redistributing activation magnitudes originating from embedding matrices. We validate OSP by training a 1.4B-parameter model on 1 trillion tokens, which is the first production-scale LLM trained without such outliers. Under aggressive 4-bit quantization, our OSP model achieves a 35.7 average score across 10 benchmarks (compared to 26.5 for an Adam-trained model), with only a 2% training overhead. Remarkably, OSP models exhibit near-zero excess kurtosis (0.04) compared to extreme values (1818.56) in standard models, fundamentally altering LLM quantization behavior. Our work demonstrates that outliers are not inherent to LLMs but are consequences of training strategies, paving the way for more efficient LLM deployment. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Outlier-Safe-Pre-Training.
中文: 离群值安全预训练(OSP)通过三项关键创新主动预防大语言模型中的极端激活离群值,在激进4位量化下实现卓越性能且仅增加2%训练开销,从根本上改变了模型量化行为。
English: Outlier-Safe Pre-Training (OSP) is a novel training strategy that proactively prevents extreme activation outliers in LLMs through three key innovations, enabling superior 4-bit quantization performance with minimal training overhead and fundamentally changing LLM deployment efficiency.

Authors:Zhenke Duan, Jiqun Pan, Jiani Tu, Xiaoyi Wang, Yanqing Wang
Title: ECCoT: A Framework for Enhancing Effective Cognition via Chain of Thought in Large Language Model
Abstract:
In the era of large-scale artificial intelligence, Large Language Models (LLMs) have made significant strides in natural language processing. However, they often lack transparency and generate unreliable outputs, raising concerns about their interpretability. To address this, the Chain of Thought (CoT) prompting method structures reasoning into step-by-step deductions. Yet, not all reasoning chains are valid, and errors can lead to unreliable conclusions. We propose ECCoT, an End-to-End Cognitive Chain of Thought Validation Framework, to evaluate and refine reasoning chains in LLMs. ECCoT integrates the Markov Random Field-Embedded Topic Model (MRF-ETM) for topic-aware CoT generation and Causal Sentence-BERT (CSBert) for causal reasoning alignment. By filtering ineffective chains using structured ordering statistics, ECCoT improves interpretability, reduces biases, and enhances the trustworthiness of LLM-based decision-making. Key contributions include the introduction of ECCoT, MRF-ETM for topic-driven CoT generation, and CSBert for causal reasoning enhancement. Code is released at: https://github.com/erwinmsmith/ECCoT.git.
中文摘要:ECCoT框架通过MRF-ETM实现主题感知的思维链生成,并利用CSBert进行因果对齐验证,有效提升大语言模型推理链的可靠性,增强可解释性并减少偏见。
English Summary: The ECCoT framework enhances the reliability of Large Language Models by validating reasoning chains using MRF-ETM for topic-aware generation and CSBert for causal alignment, thereby improving interpretability and reducing biases.

Authors:Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
Title: Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
Abstract:
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and >10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
中文: Mem4Nav提出了一种层次化空间认知记忆系统,通过融合细粒度体素索引与语义地标连通性来增强视觉语言导航智能体,在多个基准测试中实现了显著的性能提升。
English: Mem4Nav introduces a hierarchical spatial-cognition memory system that enhances Vision-and-Language Navigation agents by combining fine-grained voxel indexing with semantic landmark connectivity, achieving significant performance improvements across multiple benchmarks.

Authors:Jisu Shin, Juhyun Oh, Eunsu Kim, Hoyun Song, Alice Oh
Title: Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
Abstract:
Ensuring persona fidelity in large language models (LLMs) is essential for maintaining coherent and engaging human-AI interactions. However, LLMs often exhibit Out-of-Character (OOC) behavior, where generated responses deviate from an assigned persona, leading to inconsistencies that affect model reliability. Existing evaluation methods typically assign single scores to entire responses, struggling to capture subtle persona misalignment, particularly in long-form text generation. To address this limitation, we propose an atomic-level evaluation framework that quantifies persona fidelity at a finer granularity. Our three key metrics measure the degree of persona alignment and consistency within and across generations. Our approach enables a more precise and realistic assessment of persona fidelity by identifying subtle deviations that real users would encounter. Through our experiments, we demonstrate that our framework effectively detects persona inconsistencies that prior methods overlook. By analyzing persona fidelity across diverse tasks and personality types, we reveal how task structure and persona desirability influence model adaptability, highlighting challenges in maintaining consistent persona expression.
中文: 提出的原子级评估框架以更精细的粒度衡量大语言模型的人物忠实度,有效检测出现有方法忽略的细微不一致性,并揭示任务结构和角色期望如何影响模型的适应性。
English: The proposed atomic-level evaluation framework measures persona fidelity in LLMs with finer granularity, effectively detecting subtle inconsistencies overlooked by existing methods and revealing how task structure and persona desirability impact model adaptability.

Authors:Ramaravind K. Mothilal, Joanna Roy, Syed Ishtiaque Ahmed, Shion Guha
Title: Human-Aligned Faithfulness in Toxicity Explanations of LLMs
Abstract:
The discourse around toxicity and LLMs in NLP largely revolves around detection tasks. This work shifts the focus to evaluating LLMs' reasoning about toxicity -- from their explanations that justify a stance -- to enhance their trustworthiness in downstream tasks. Despite extensive research on explainability, it is not straightforward to adopt existing methods to evaluate free-form toxicity explanation due to their over-reliance on input text perturbations, among other challenges. To account for these, we propose a novel, theoretically-grounded multi-dimensional criterion, Human-Aligned Faithfulness (HAF), that measures the extent to which LLMs' free-form toxicity explanations align with those of a rational human under ideal conditions. We develop six metrics, based on uncertainty quantification, to comprehensively evaluate \haf of LLMs' toxicity explanations with no human involvement, and highlight how "non-ideal" the explanations are. We conduct several experiments on three Llama models (of size up to 70B) and an 8B Ministral model on five diverse toxicity datasets. Our results show that while LLMs generate plausible explanations to simple prompts, their reasoning about toxicity breaks down when prompted about the nuanced relations between the complete set of reasons, the individual reasons, and their toxicity stances, resulting in inconsistent and nonsensical responses. We open-source our code and LLM-generated explanations at https://github.com/uofthcdslab/HAF.
Chinese Summary: 本研究提出人类对齐忠实度(HAF)这一新标准,用于评估大语言模型毒性解释与人类推理的对齐程度,发现尽管模型能生成合理的基础解释,但在处理原因与毒性立场间的微妙关系时,其推理能力会出现崩溃。
English Summary: This research introduces Human-Aligned Faithfulness (HAF), a novel criterion to evaluate how well LLMs' toxicity explanations align with human reasoning, revealing that while models produce plausible basic explanations, their reasoning collapses when addressing nuanced relationships between reasons and toxicity stances.

Authors:Sahil Kale, Vijaykant Nadadur
Title: Mirage of Mastery: Memorization Tricks LLMs into Artificially Inflated Self-Knowledge
Abstract:
When artificial intelligence mistakes memorization for intelligence, it creates a dangerous mirage of reasoning. Existing studies treat memorization and self-knowledge deficits in LLMs as separate issues and do not recognize an intertwining link that degrades the trustworthiness of LLM responses. In our study, we utilize a novel framework to ascertain if LLMs genuinely learn reasoning patterns from training data or merely memorize them to assume competence across problems of similar complexity focused on STEM domains. Our analysis shows a noteworthy problem in generalization: LLMs draw confidence from memorized solutions to infer a higher self-knowledge about their reasoning ability, which manifests as an over 45% inconsistency in feasibility assessments when faced with self-validated, logically coherent task perturbations. This effect is most pronounced in science and medicine domains, which tend to have maximal standardized jargon and problems, further confirming our approach. Significant wavering within the self-knowledge of LLMs also shows flaws in current architectures and training patterns, highlighting the need for techniques that ensure a balanced, consistent stance on models' perceptions of their own knowledge for maximum AI explainability and trustworthiness. Our code and results are available publicly at https://github.com/knowledge-verse-ai/LLM-Memorization_SK_Eval-.
中文摘要:人工智能将记忆误认为推理,导致其自我认知不可靠,在面对逻辑一致的任务变化时可行性评估出现超45%的不一致性,尤其在STEM领域最为显著。
English Summary: AI's confusion between memorization and genuine reasoning leads to unreliable self-assessment, with over 45% inconsistency in handling modified tasks, especially in STEM fields.

Authors:Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Title: From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Abstract:
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
中文摘要:具备推理能力的大语言模型正在开创“智能深度研究”新范式,通过自主推理与迭代检索的深度融合,显著超越了传统搜索方法,有望成为未来信息获取的主导模式。
English Summary: Large Language Models with reasoning capabilities are pioneering Agentic Deep Research, a new paradigm that integrates autonomous reasoning and iterative retrieval to significantly outperform traditional search methods and redefine future information seeking.

Authors:Zihan Wang, Rui Pan, Jiarui Yao, Robert Csordas, Linjie Li, Lu Yin, Jiajun Wu, Tong Zhang, Manling Li, Shiwei Liu
Title: Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models
Abstract:
We propose Chain-of-Experts (CoE), a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer. Unlike traditional MoE models, where experts operate independently in parallel, CoE processes tokens iteratively across a chain of experts inside a layer. To support dynamic expert selection across iterations, CoE employs a dedicated router at each iteration step within a layer. This design allows tokens to re-evaluate and select different experts during each iteration, rather than being statically assigned. As a result, CoE introduces a flexible routing mechanism that increases the diversity of expert combinations and enriches the model's representational capacity. CoE demonstrates improved performance under fixed compute: on math reasoning tasks, it reduces validation loss from 1.20 to 1.12 compared to a standard MoE. Beyond performance, CoE offers a new scaling axis: depth through expert iteration, which complements conventional width/depth scaling. For example, using 2x iterations matches the performance of 3x expert selections (in width), while reducing memory usage by 17.6-42% relative to other scaling strategies. Our analysis reveals that CoE's benefits stem from its iterative residual structure and enhanced expert specialization empowered by iterative routing, which together unlock more expressive representations. Code is available at https://github.com/ZihanWang314/coe.
中文: Chain-of-Experts (CoE) 提出在层内实现专家顺序通信的新架构,通过动态路由和迭代处理增强模型表达能力,在数学推理任务上将验证损失从1.20降至1.12,相比传统混合专家模型内存使用减少17.6-42%。
English: Chain-of-Experts (CoE) introduces sequential expert communication within layers, enabling dynamic routing and iterative processing that enhances model capacity and reduces validation loss from 1.20 to 1.12 on math tasks while cutting memory usage by 17.6-42% compared to traditional MoE architectures.

Authors:Jiaru Zou, Ling Yang, Jingwen Gu, Jiahao Qiu, Ke Shen, Jingrui He, Mengdi Wang
Title: ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs
Abstract:
Process Reward Models (PRMs) have recently emerged as a powerful framework for supervising intermediate reasoning steps in large language models (LLMs). Previous PRMs are primarily trained on model final output responses and struggle to evaluate intermediate thinking trajectories robustly, especially in the emerging setting of trajectory-response outputs generated by frontier reasoning models like Deepseek-R1. In this work, we introduce ReasonFlux-PRM, a novel trajectory-aware PRM explicitly designed to evaluate the trajectory-response type of reasoning traces. ReasonFlux-PRM incorporates both step-level and trajectory-level supervision, enabling fine-grained reward assignment aligned with structured chain-of-thought data. We adapt ReasonFlux-PRM to support reward supervision under both offline and online settings, including (i) selecting high-quality model distillation data for downstream supervised fine-tuning of smaller models, (ii) providing dense process-level rewards for policy optimization during reinforcement learning, and (iii) enabling reward-guided Best-of-N test-time scaling. Empirical results on challenging downstream benchmarks such as AIME, MATH500, and GPQA-Diamond demonstrate that ReasonFlux-PRM-7B selects higher quality data than strong PRMs (e.g., Qwen2.5-Math-PRM-72B) and human-curated baselines. Furthermore, our derived ReasonFlux-PRM-7B yields consistent performance improvements, achieving average gains of 12.1% in supervised fine-tuning, 4.5% in reinforcement learning, and 6.3% in test-time scaling. We also release our efficient ReasonFlux-PRM-1.5B for resource-constrained applications and edge deployment. Project: https://github.com/Gen-Verse/ReasonFlux
中文: ReasonFlux-PRM是一种新型轨迹感知过程奖励模型,专门用于评估推理轨迹和响应,在多个基准测试中展现出优于传统模型的数据选择能力和性能提升。
English: ReasonFlux-PRM is a novel trajectory-aware process reward model designed to evaluate intermediate reasoning steps and responses, demonstrating superior data selection and performance gains in fine-tuning, reinforcement learning, and test-time scaling across multiple benchmarks.

Authors:Junyan Li, Yang Zhang, Muhammad Yusuf Hassan, Talha Chafekar, Tianle Cai, Zhile Ren, Pengsheng Guo, Foroozan Karimzadeh, Colorado Reed, Chong Wang, Chuang Gan
Title: CommVQ: Commutative Vector Quantization for KV Cache Compression
Abstract:
Large Language Models (LLMs) are increasingly used in applications requiring long context lengths, but the key-value (KV) cache often becomes a memory bottleneck on GPUs as context grows. To address this, we propose Commutative Vector Quantization (CommVQ) to significantly reduce memory usage for long-context LLM inference. We first introduce additive quantization with a lightweight encoder and codebook to compress the KV cache, which can be decoded via simple matrix multiplication. To further reduce computational costs during decoding, we design the codebook to be commutative with Rotary Position Embedding (RoPE) and train it using an Expectation-Maximization (EM) algorithm. This enables efficient integration of decoding into the self-attention mechanism. Our approach achieves high accuracy with additive quantization and low overhead via the RoPE-commutative codebook. Experiments on long-context benchmarks and GSM8K show that our method reduces FP16 KV cache size by 87.5% with 2-bit quantization, while outperforming state-of-the-art KV cache quantization methods. Notably, it enables 1-bit KV cache quantization with minimal accuracy loss, allowing a LLaMA-3.1 8B model to run with a 128K context length on a single RTX 4090 GPU. The source code is available at: https://github.com/UMass-Embodied-AGI/CommVQ.
中文: 本文提出交换向量量化(CommVQ)方法,通过加法量化和与旋转位置编码兼容的码本压缩键值缓存,在保持精度的同时将GPU内存使用降低高达87.5%,使LLaMA-3.1 8B模型能在单张RTX 4090显卡上处理128K长文本。
English: This paper introduces Commutative Vector Quantization (CommVQ), a method that reduces GPU memory usage for long-context LLM inference by compressing the key-value cache with additive quantization and a RoPE-commutative codebook, achieving up to 87.5% size reduction with minimal accuracy loss.

Authors:Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Title: OmniGen2: Exploration to Advanced Multimodal Generation
Abstract:
In this work, we introduce OmniGen2, a versatile and open-source generative model designed to provide a unified solution for diverse generation tasks, including text-to-image, image editing, and in-context generation. Unlike OmniGen v1, OmniGen2 features two distinct decoding pathways for text and image modalities, utilizing unshared parameters and a decoupled image tokenizer. This design enables OmniGen2 to build upon existing multimodal understanding models without the need to re-adapt VAE inputs, thereby preserving the original text generation capabilities. To facilitate the training of OmniGen2, we developed comprehensive data construction pipelines, encompassing image editing and in-context generation data. Additionally, we introduce a reflection mechanism tailored for image generation tasks and curate a dedicated reflection dataset based on OmniGen2. Despite its relatively modest parameter size, OmniGen2 achieves competitive results on multiple task benchmarks, including text-to-image and image editing. To further evaluate in-context generation, also referred to as subject-driven tasks, we introduce a new benchmark named OmniContext. OmniGen2 achieves state-of-the-art performance among open-source models in terms of consistency. We will release our models, training code, datasets, and data construction pipeline to support future research in this field. Project Page: https://vectorspacelab.github.io/OmniGen2; GitHub Link: https://github.com/VectorSpaceLab/OmniGen2
中文: OmniGen2 是一种开源生成模型,通过双解码路径和反射机制统一处理文本到图像、图像编辑及上下文生成任务,在保持文本生成能力的同时实现了领先性能。
English: OmniGen2 is an open-source generative model that unifies text-to-image, image editing, and in-context generation tasks through dual decoding pathways and a reflection mechanism, achieving state-of-the-art performance while preserving text generation capabilities.

Authors:Siao Tang, Xinyin Ma, Gongfan Fang, Xinchao Wang
Title: ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation
Abstract:
Recent advancements in large reasoning models (LRMs) like DeepSeek-R1 and OpenAI o1 series have achieved notable performance enhancements on complex reasoning tasks by scaling up the generation length by Chain-of-Thought (CoT). However, an emerging issue is their inclination to produce excessively verbose reasoning processes, leading to the inefficiency problem. Existing literature on improving efficiency mainly adheres to the before-reasoning paradigms such as prompting and reasoning or fine-tuning and reasoning, but ignores the promising direction of directly encouraging the model to speak concisely by intervening during the generation of reasoning. In order to fill the blank, we propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely by injecting the textual hint (manually designed or trained on the concise data) during the token generation of the reasoning process. Besides, ConciseHint is adaptive to the complexity of the query by adaptively adjusting the hint intensity, which ensures it will not undermine model performance. Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning processes while maintaining performance well. For instance, we achieve a reduction ratio of 65\% for the reasoning length on GSM8K benchmark with Qwen-3 4B with nearly no accuracy loss.
中文:提出的ConciseHint框架通过在推理生成过程中注入自适应提示,有效解决大型推理模型冗长问题,能在保持性能的同时生成简洁输出,并可无缝兼容现有方法。
English: The proposed ConciseHint framework addresses the verbosity of large reasoning models by injecting adaptive hints during reasoning generation, effectively producing concise outputs while maintaining performance and compatibility with existing methods.

Authors:Zhenru Lin, Jiawen Tao, Yang Yuan, Andrew Chi-Chih Yao
Title: Existing LLMs Are Not Self-Consistent For Simple Tasks
Abstract:
Large Language Models (LLMs) have grown increasingly powerful, yet ensuring their decisions remain transparent and trustworthy requires self-consistency -- no contradictions in their internal reasoning. Our study reveals that even on simple tasks, such as comparing points on a line or a plane, or reasoning in a family tree, all smaller models are highly inconsistent, and even state-of-the-art models like DeepSeek-R1 and GPT-o4-mini are not fully self-consistent. To quantify and mitigate these inconsistencies, we introduce inconsistency metrics and propose two automated methods -- a graph-based and an energy-based approach. While these fixes provide partial improvements, they also highlight the complexity and importance of self-consistency in building more reliable and interpretable AI. The code and data are available at https://github.com/scorpio-nova/llm-self-consistency.
中文摘要:大型语言模型即使在简单任务中也存在显著的自洽性问题,本研究提出的度量方法和缓解方案虽取得部分改进,但凸显了实现可靠人工智能推理的复杂性。
English Summary: Large Language Models exhibit significant self-consistency issues even in simple tasks, prompting the development of metrics and mitigation methods that partially address but underscore the complexity of achieving reliable AI reasoning.

Authors:Chong Zhang, Xiang Li, Jia Wang, Shan Liang, Haochen Xue, Xiaobo Jin
Title: Semantic-Preserving Adversarial Attacks on LLMs: An Adaptive Greedy Binary Search Approach
Abstract:
Large Language Models (LLMs) increasingly rely on automatic prompt engineering in graphical user interfaces (GUIs) to refine user inputs and enhance response accuracy. However, the diversity of user requirements often leads to unintended misinterpretations, where automated optimizations distort original intentions and produce erroneous outputs. To address this challenge, we propose the Adaptive Greedy Binary Search (AGBS) method, which simulates common prompt optimization mechanisms while preserving semantic stability. Our approach dynamically evaluates the impact of such strategies on LLM performance, enabling robust adversarial sample generation. Through extensive experiments on open and closed-source LLMs, we demonstrate AGBS's effectiveness in balancing semantic consistency and attack efficacy. Our findings offer actionable insights for designing more reliable prompt optimization systems. Code is available at: https://github.com/franz-chang/DOBS
中文摘要:自适应贪婪二分搜索(AGBS)方法通过保持语义稳定性并评估优化策略对大型语言模型性能的影响,有效解决了自动提示工程中的误解问题,实验证明其在开源和闭源模型上均能平衡语义一致性与攻击效果。
English Summary: The Adaptive Greedy Binary Search (AGBS) method is introduced to mitigate misinterpretations in automatic prompt engineering by preserving semantic stability while evaluating optimization impacts on LLM performance, demonstrating effectiveness through experiments on various models.

Authors:Jie Li, Shifei Ding, Lili Guo, Xuan Li
Title: Multi-modal Anchor Gated Transformer with Knowledge Distillation for Emotion Recognition in Conversation
Abstract:
Emotion Recognition in Conversation (ERC) aims to detect the emotions of individual utterances within a conversation. Generating efficient and modality-specific representations for each utterance remains a significant challenge. Previous studies have proposed various models to integrate features extracted using different modality-specific encoders. However, they neglect the varying contributions of modalities to this task and introduce high complexity by aligning modalities at the frame level. To address these challenges, we propose the Multi-modal Anchor Gated Transformer with Knowledge Distillation (MAGTKD) for the ERC task. Specifically, prompt learning is employed to enhance textual modality representations, while knowledge distillation is utilized to strengthen representations of weaker modalities. Furthermore, we introduce a multi-modal anchor gated transformer to effectively integrate utterance-level representations across modalities. Extensive experiments on the IEMOCAP and MELD datasets demonstrate the effectiveness of knowledge distillation in enhancing modality representations and achieve state-of-the-art performance in emotion recognition. Our code is available at: https://github.com/JieLi-dd/MAGTKD.
Chinese: 提出的MAGTKD模型通过提示学习和知识蒸馏增强模态表示,并利用多模态锚点门控变换器有效整合各模态信息,在基准数据集上实现了对话中情感识别的最新性能。
English: The proposed MAGTKD model enhances emotion recognition in conversations by using prompt learning and knowledge distillation to improve modality representations and integrates them effectively with a multi-modal anchor gated transformer, achieving state-of-the-art results on benchmark datasets.

Authors:Haoyi Wu, Zhihao Teng, Kewei Tu
Title: Parallel Continuous Chain-of-Thought with Jacobi Iteration
Abstract:
Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models. By reasoning with continuous latent thought tokens, continuous CoT is able to perform implicit reasoning in a compact manner. However, the sequential dependencies between latent thought tokens spoil parallel training, leading to long training time. In this paper, we propose Parallel Continuous Chain-of-Thought (PCCoT), which performs Jacobi iteration on the latent thought tokens, updating them iteratively in parallel instead of sequentially and thus improving both training and inference efficiency of continuous CoT. Experiments demonstrate that by choosing the proper number of iterations, we are able to achieve comparable or even better performance while saving nearly 50% of the training and inference time. Moreover, PCCoT shows better stability and robustness in the training process. Our code is available at https://github.com/whyNLP/PCCoT.
Chinese: 并行连续思维链(PCCoT)通过雅可比迭代并行更新潜在思维标记,在保持相当或更优性能的同时,将训练和推理时间减少近50%,并提升了稳定性和鲁棒性。
English: Parallel Continuous Chain-of-Thought (PCCoT) enhances efficiency by using Jacobi iteration to update latent thought tokens in parallel, achieving comparable or better performance while cutting training and inference time by nearly 50% and improving stability.

Authors:Yuting Zhang, Kaishen Yuan, Hao Lu, Yutao Yue, Jintai Chen, Kaishun Wu
Title: MedTVT-R1: A Multimodal LLM Empowering Medical Reasoning and Diagnosis
Abstract:
Accurate and interpretable multi-disease diagnosis remains a critical challenge in medical research, particularly when leveraging heterogeneous multimodal medical data. Current approaches often rely on single-modal data, limiting their ability to comprehensively understand complex diseases. To address this, we propose MedTVT-R1, a novel Multimodal Large Language Model (MLLM) framework designed to integrate clinical multimodal data for reasoning and diagnosing multiple diseases. We construct MedTVT-QA, a curated instruction dataset that provides question-answer pairs for physiological-level interpretations and disease-level diagnoses with a Chain of Evidence approach. MedTVT-R1 incorporates a modality perception layer to capture inter-modal dependencies and adaptively weight modality contributions. Additionally, we employ Group Relative Policy Optimization (GRPO)-based Reinforcement Fine-Tuning with a Jaccard Reward function to enhance diagnostic reasoning. Experimental results demonstrate MedTVT-R1's superiority in multimodal feature utilization and multi-disease diagnosis, offering significant potential for clinical applications such as diagnostic report generation and comorbidity reasoning. The dataset and code are available at https://github.com/keke-nice/MedTVT-R1.
中文:MedTVT-R1是一种新型多模态大语言模型,通过整合临床数据和采用强化微调技术,实现了精准的多疾病诊断与可解释推理,在医疗应用中展现出卓越性能。
English: MedTVT-R1 is a novel multimodal large language model that integrates clinical data for accurate multi-disease diagnosis and interpretable reasoning, demonstrating superior performance through advanced modality fusion and reinforcement fine-tuning.

Authors:Markus Frohmann, Elena V. Epure, Gabriel Meseguer-Brocal, Markus Schedl, Romain Hennequin
Title: AI-Generated Song Detection via Lyrics Transcripts
Abstract:
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
中文: 本研究通过使用自动语音识别转录歌曲并采用多种检测器,提出了一种强大的AI生成音乐检测方法,在多语言和多流派场景下表现优异,且在处理失真音频和不同音乐生成器时优于现有音频检测技术。
English: This study addresses the limitations of existing AI-generated music detection methods by proposing a robust approach that transcribes songs using automatic speech recognition and employs multiple detectors, demonstrating strong performance across languages and genres while outperforming audio-based methods when handling perturbed audio and diverse music generators.

Authors:Lixin Wu, Na Cai, Qiao Cheng, Jiachen Wang, Yitao Duan
Title: Confucius3-Math: A Lightweight High-Performance Reasoning LLM for Chinese K-12 Mathematics Learning
Abstract:
We introduce Confucius3-Math, an open-source large language model with 14B parameters that (1) runs efficiently on a single consumer-grade GPU; (2) achieves SOTA performances on a range of mathematical reasoning tasks, outperforming many models with significantly larger sizes. In particular, as part of our mission to enhancing education and knowledge dissemination with AI, Confucius3-Math is specifically committed to mathematics learning for Chinese K-12 students and educators. Built via post-training with large-scale reinforcement learning (RL), Confucius3-Math aligns with national curriculum and excels at solving main-stream Chinese K-12 mathematical problems with low cost. In this report we share our development recipe, the challenges we encounter and the techniques we develop to overcome them. In particular, we introduce three technical innovations: Targeted Entropy Regularization, Recent Sample Recovery and Policy-Specific Hardness Weighting. These innovations encompass a new entropy regularization, a novel data scheduling policy, and an improved group-relative advantage estimator. Collectively, they significantly stabilize the RL training, improve data efficiency, and boost performance. Our work demonstrates the feasibility of building strong reasoning models in a particular domain at low cost. We open-source our model and code at https://github.com/netease-youdao/Confucius3-Math.
Chinese: Confucius3-Math是一个开源140亿参数模型,可在消费级GPU上高效运行,在数学推理任务中达到顶尖水平,并针对中国K-12数学教育通过强化学习技术创新实现了低成本高性能。
English: Confucius3-Math is an open-source 14B-parameter model that efficiently runs on consumer GPUs and achieves state-of-the-art performance in mathematical reasoning, specifically tailored for Chinese K-12 education with innovations in reinforcement learning training.

Authors:Tianyu Yu, Bo Ji, Shouli Wang, Shu Yao, Zefan Wang, Ganqu Cui, Lifan Yuan, Ning Ding, Yuan Yao, Zhiyuan Liu, Maosong Sun, Tat-Seng Chua
Title: RLPR: Extrapolating RLVR to General Domains without Verifiers
Abstract:
Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates promising potential in advancing the reasoning capabilities of LLMs. However, its success remains largely confined to mathematical and code domains. This primary limitation stems from the heavy reliance on domain-specific verifiers, which results in prohibitive complexity and limited scalability. To address the challenge, our key observation is that LLM's intrinsic probability of generating a correct free-form answer directly indicates its own evaluation of the reasoning reward (i.e., how well the reasoning process leads to the correct answer). Building on this insight, we propose RLPR, a simple verifier-free framework that extrapolates RLVR to broader general domains. RLPR uses the LLM's own token probability scores for reference answers as the reward signal and maximizes the expected reward during training. We find that addressing the high variance of this noisy probability reward is crucial to make it work, and propose prob-to-reward and stabilizing methods to ensure a precise and stable reward from LLM intrinsic probabilities. Comprehensive experiments in four general-domain benchmarks and three mathematical benchmarks show that RLPR consistently improves reasoning capabilities in both areas for Gemma, Llama, and Qwen based models. Notably, RLPR outperforms concurrent VeriFree by 7.6 points on TheoremQA and 7.5 points on Minerva, and even surpasses strong verifier-model-dependent approaches General-Reasoner by 1.6 average points across seven benchmarks.
Chinese: RLPR是一种无需验证器的强化学习框架,它利用大语言模型自身对参考答案的标记概率作为奖励信号,有效提升了在通用领域和数学领域的推理能力,并超越了现有方法的性能。
English: RLPR is a verifier-free reinforcement learning framework that uses an LLM's own token probability scores for reference answers as the reward signal, effectively enhancing reasoning capabilities across both general and mathematical domains while outperforming existing methods.

Authors:Yicheng Fu, Zhemin Huang, Liuxin Yang, Yumeng Lu, Zhongdongming Dai
Title: Chengyu-Bench: Benchmarking Large Language Models for Chinese Idiom Understanding and Use
Abstract:
Chinese idioms (Chengyu) are concise four-character expressions steeped in history and culture, whose literal translations often fail to capture their full meaning. This complexity makes them challenging for language models to interpret and use correctly. Existing benchmarks focus on narrow tasks - multiple-choice cloze tests, isolated translation, or simple paraphrasing. We introduce Chengyu-Bench, a comprehensive benchmark featuring three tasks: (1) Evaluative Connotation, classifying idioms as positive or negative; (2) Appropriateness, detecting incorrect idiom usage in context; and (3) Open Cloze, filling blanks in longer passages without options. Chengyu-Bench comprises 2,937 human-verified examples covering 1,765 common idioms sourced from diverse corpora. We evaluate leading LLMs and find they achieve over 95% accuracy on Evaluative Connotation, but only ~85% on Appropriateness and ~40% top-1 accuracy on Open Cloze. Error analysis reveals that most mistakes arise from fundamental misunderstandings of idiom meanings. Chengyu-Bench demonstrates that while LLMs can reliably gauge idiom sentiment, they still struggle to grasp the cultural and contextual nuances essential for proper usage. The benchmark and source code are available at: https://github.com/sofyc/ChengyuBench.
Chinese Summary: Chengyu-Bench是一个全面评估语言模型对中文成语理解能力的基准测试,包含三项任务,结果表明模型虽能准确判断成语情感倾向,但在语境运用和语义理解方面仍有明显不足。
English Summary: Chengyu-Bench is a comprehensive benchmark designed to evaluate language models' understanding of Chinese idioms through three tasks, revealing that while models excel at identifying sentiment, they struggle with contextual usage and meaning comprehension.

Authors:Fuyu Wang, Jiangtong Li, Kun Zhu, Changjun Jiang
Title: InspireDebate: Multi-Dimensional Subjective-Objective Evaluation-Guided Reasoning and Optimization for Debating
Abstract:
With the rapid advancements in large language models (LLMs), debating tasks, such as argument quality assessment and debate process simulation, have made significant progress. However, existing LLM-based debating systems focus on responding to specific arguments while neglecting objective assessments such as authenticity and logical validity. Furthermore, these systems lack a structured approach to optimize across various dimensions$-$including evaluation metrics, chain-of-thought (CoT) reasoning, and multi-turn debate refinement$-$thereby limiting their effectiveness. To address these interconnected challenges, we propose a dual-component framework: (1) $\textbf{InspireScore}$, a novel evaluation system that establishes a multi-dimensional assessment architecture incorporating four subjective criteria (emotional appeal, argument clarity, argument arrangement, and topic relevance) alongside two objective metrics (fact authenticity and logical validity); and (2) $\textbf{InspireDebate}$, an optimized debating framework employing a phased optimization approach through CoT reasoning enhancement, multi-dimensional Direct Preference Optimization (DPO), and real-time knowledge grounding via web-based Retrieval Augmented Generation (Web-RAG). Empirical evaluations demonstrate that $\textbf{InspireScore}$ achieves 44$\%$ higher correlation with expert judgments compared to existing methods, while $\textbf{InspireDebate}$ shows significant improvements, outperforming baseline models by 57$\%$. Source code is available at https://github.com/fywang12/InspireDebate.
中文摘要:本文提出包含InspireScore评估系统和InspireDebate辩论框架的双组件方案,通过建立多维度评估架构和分阶段优化方法,有效解决了现有大语言模型辩论系统在客观评估和结构化优化方面的不足,实验表明其与专家判断相关性提升44%,性能较基线模型提高57%。
English Summary: This paper introduces a dual-component framework, InspireScore and InspireDebate, to address limitations in current LLM-based debating systems by implementing multi-dimensional evaluation metrics and phased optimization techniques, achieving significantly higher correlation with expert judgments and performance improvements over baseline models.

Authors:Tianxing Chen, Zanxin Chen, Baijun Chen, Zijian Cai, Yibin Liu, Zixuan Li, Qiwei Liang, Xianliang Lin, Yiheng Ge, Zhenyu Gu, Weiliang Deng, Yubin Guo, Tian Nian, Xuanbing Xie, Qiangyu Chen, Kailun Su, Tianling Xu, Guodong Liu, Mengkang Hu, Huan-ang Gao, Kaixuan Wang, Zhixuan Liang, Yusen Qin, Xiaokang Yang, Ping Luo, Yao Mu
Title: RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation
Abstract:
Simulation-based data synthesis has emerged as a powerful paradigm for advancing real-world robotic manipulation. Yet existing datasets remain insufficient for robust bimanual manipulation due to (1) the lack of scalable task generation methods and (2) oversimplified simulation environments. We present RoboTwin 2.0, a scalable framework for automated, large-scale generation of diverse and realistic data, together with unified evaluation protocols for dual-arm manipulation. At its core is RoboTwin-OD, an object library of 731 instances across 147 categories with semantic and manipulation-relevant annotations. Building on this, we design an expert data synthesis pipeline that leverages multimodal language models (MLLMs) and simulation-in-the-loop refinement to automatically generate task-level execution code. To improve sim-to-real transfer, RoboTwin 2.0 applies structured domain randomization along five axes: clutter, lighting, background, tabletop height, and language, enhancing data diversity and policy robustness. The framework is instantiated across 50 dual-arm tasks and five robot embodiments. Empirically, it yields a 10.9% gain in code generation success rate. For downstream policy learning, a VLA model trained with synthetic data plus only 10 real demonstrations achieves a 367% relative improvement over the 10-demo baseline, while zero-shot models trained solely on synthetic data obtain a 228% gain. These results highlight the effectiveness of RoboTwin 2.0 in strengthening sim-to-real transfer and robustness to environmental variations. We release the data generator, benchmark, dataset, and code to support scalable research in robust bimanual manipulation. Project Page: https://robotwin-platform.github.io/, Code: https://github.com/robotwin-Platform/robotwin/.
中文:RoboTwin 2.0通过自动化任务合成和领域随机化提出了可扩展的双臂操作数据生成框架,在仿真到实物的迁移和政策鲁棒性方面实现了显著提升。
English: RoboTwin 2.0 introduces a scalable framework for generating diverse bimanual manipulation data through automated task synthesis and domain randomization, achieving significant improvements in sim-to-real transfer and policy robustness.

Authors:Kui Huang, Xinrong Chen, Wenyu Lv, Jincheng Liao, Guanzhong Wang, Yi Liu
Title: PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
Abstract:
This report introduces PP-DocBee2, an advanced version of the PP-DocBee, designed to enhance multimodal document understanding. Built on a large multimodal model architecture, PP-DocBee2 addresses the limitations of its predecessor through key technological improvements, including enhanced synthetic data quality, improved visual feature fusion strategy, and optimized inference methodologies. These enhancements yield an $11.4\%$ performance boost on internal benchmarks for Chinese business documents, and reduce inference latency by $73.0\%$ to the vanilla version. A key innovation of our work is a data quality optimization strategy for multimodal document tasks. By employing a large-scale multimodal pre-trained model to evaluate data, we apply a novel statistical criterion to filter outliers, ensuring high-quality training data. Inspired by insights into underutilized intermediate features in multimodal models, we enhance the ViT representational capacity by decomposing it into layers and applying a novel feature fusion strategy to improve complex reasoning. The source code and pre-trained model are available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
Chinese: PP-DocBee2通过优化数据质量和视觉特征融合策略,在中文商务文档理解任务中性能提升11.4%,推理延迟降低73.0%,显著提升了多模态文档理解能力。
English: PP-DocBee2 significantly improves multimodal document understanding with enhanced data quality and feature fusion, boosting performance by 11.4% and reducing latency by 73.0% compared to its predecessor.

Authors:Quanwei Tang, Sophia Yat Mei Lee, Junshuang Wu, Dong Zhang, Shoushan Li, Erik Cambria, Guodong Zhou
Title: A Comprehensive Graph Framework for Question Answering with Mode-Seeking Preference Alignment
Abstract:
Recent advancements in retrieval-augmented generation (RAG) have enhanced large language models in question answering by integrating external knowledge. However, challenges persist in achieving global understanding and aligning responses with human ethical and quality preferences. To address these issues, we propose GraphMPA, a comprehensive graph-based framework with mode-seeking preference alignment. Our approach constructs a hierarchical document graph using a general similarity measurement, mimicking human cognitive processes for information understanding and synthesis. Additionally, we introduce mode-seeking preference optimization to better align model outputs with human preferences through probability-matching constraints. Extensive experiments on six datasets demonstrate the effectiveness of our \href{https://github.com/tangquanwei/GraphMPA}{GraphMPA}.
中文:GraphMPA框架通过构建分层文档图和采用模式寻求偏好优化,有效提升了检索增强生成的性能,使模型输出更符合人类认知与伦理标准。
English: The proposed GraphMPA framework enhances retrieval-augmented generation by constructing hierarchical document graphs and employing mode-seeking preference optimization to better align model outputs with human cognitive and ethical standards.

Authors:Jianyu Wang, Zhiqiang Hu, Lidong Bing
Title: Evolving Prompts In-Context: An Open-ended, Self-replicating Perspective
Abstract:
We propose a novel prompt design paradigm that challenges conventional wisdom in large language model (LLM) prompting. While conventional wisdom prioritizes well-crafted instructions and demonstrations for in-context learning (ICL), we show that pruning random demonstrations into seemingly incoherent "gibberish" can remarkably improve performance across diverse tasks. Notably, the "gibberish" always matches or surpasses state-of-the-art automatic prompt optimization techniques, achieving substantial gains regardless of LLM alignment. Nevertheless, discovering an effective pruning strategy is non-trivial, as existing attribution methods and prompt compression algorithms fail to deliver robust results, let alone human intuition. In terms of this, we propose a self-discover prompt optimization framework, PromptQuine, an evolutionary search framework that automatically searches for the pruning strategy by itself using only low-data regimes. Much like the emergent complexity in nature--such as symbiosis and self-organization--arising in response to resource constraints, our framework evolves and refines unconventional yet highly effective prompts by leveraging only the tokens present within the context. We demonstrate its effectiveness across classification, multi-choice question answering, generation and math reasoning tasks across LLMs, while achieving decent runtime efficiency. We hope our findings can guide mechanistic studies on in-context learning, and provide a call to action, to pave the way for more open-ended search algorithms for more effective LLM prompting.
中文: 本研究提出了一种新颖的提示设计范式,证明将随机示例修剪为看似不连贯的"乱码"能在多种任务中超越传统提示方法和最优自动优化技术,同时开发了PromptQuine进化框架,仅需少量数据即可自主发现有效的修剪策略。
English: This study introduces a novel prompt design paradigm that demonstrates pruning random demonstrations into seemingly incoherent "gibberish" can outperform conventional prompting methods and state-of-the-art optimization techniques across various tasks, while also proposing PromptQuine, an evolutionary framework that autonomously discovers effective pruning strategies with minimal data.

Authors:Hua Tang, Lingyong Yan, Yukun Zhao, Shuaiqiang Wang, Jizhou Huang, Dawei Yin
Title: Multi-turn Jailbreaking via Global Refinement and Active Fabrication
Abstract:
Large Language Models (LLMs) have achieved exceptional performance across a wide range of tasks. However, they still pose significant safety risks due to the potential misuse for malicious purposes. Jailbreaks, which aim to elicit models to generate harmful content, play a critical role in identifying the underlying security threats. Recent jailbreaking primarily focuses on single-turn scenarios, while the more complicated multi-turn scenarios remain underexplored. Moreover, existing multi-turn jailbreaking techniques struggle to adapt to the evolving dynamics of dialogue as the interaction progresses. To address this limitation, we propose a novel multi-turn jailbreaking method that refines the jailbreaking path globally at each interaction. We also actively fabricate model responses to suppress safety-related warnings, thereby increasing the likelihood of eliciting harmful outputs in subsequent questions. Experimental results demonstrate the superior performance of our method compared with existing single-turn and multi-turn jailbreaking techniques across six state-of-the-art LLMs. Our code is publicly available at https://github.com/Ytang520/Multi-Turn_jailbreaking_Global-Refinment_and_Active-Fabrication.
Chinese: 提出的GRAF方法通过全局优化攻击路径和自适应伪造模型响应来规避安全机制,在六种先进大语言模型上的实验表明其多轮越狱效果显著优于现有方法。
English: The proposed GRAF method enhances multi-turn jailbreaking by globally refining attack strategies and adaptively fabricating responses to bypass safety measures, demonstrating superior effectiveness across six advanced LLMs compared to existing techniques.

Authors:Chenghao Yang, Ari Holtzman
Title: How Alignment Shrinks the Generative Horizon
Abstract:
Despite their impressive capabilities, aligned large language models (LLMs) often generate outputs that lack diversity. What drives this stability in the generation? We investigate this phenomenon through the lens of probability concentration in the model's output distribution. To quantify this concentration, we introduce the Branching Factor (BF) -- a token-invariant measure of the effective number of plausible next steps during generation. Our empirical analysis reveals two key findings: (1) BF often decreases as generation progresses, suggesting that LLMs become more predictable as they generate. (2) alignment tuning substantially sharpens the model's output distribution from the outset, reducing BF by nearly an order of magnitude (e.g., from 12 to 1.2) relative to base models. This stark reduction helps explain why aligned models often appear less sensitive to decoding strategies. Building on this insight, we find this stability has surprising implications for complex reasoning. Aligned Chain-of-Thought (CoT) models (e.g., DeepSeek-distilled models), for instance, leverage this effect; by generating longer reasoning chains, they push generation into later, more deterministic (lower BF) stages, resulting in more stable outputs. We hypothesize that alignment tuning does not fundamentally change a model's behavior, but instead steers it toward stylistic tokens (e.g., "Sure") that unlock low-entropy trajectories already present in the base model. This view is supported by nudging experiments, which show that prompting base models with such tokens can similarly reduce BF. Together, our findings establish BF as a powerful diagnostic for understanding and controlling LLM outputs - clarifying how alignment reduces variability, how CoT promotes stable generations, and how base models can be steered away from diversity.
Chinese: 对齐后的大型语言模型输出多样性降低源于概率集中现象,通过分支因子量化发现生成过程中分支因子递减且对齐调优使其大幅降低,这解释了模型稳定性及对解码策略不敏感的原因,并揭示了思维链模型通过延长推理进入确定性阶段实现稳定输出的机制。
English: Aligned large language models exhibit reduced output diversity due to probability concentration, quantified by the Branching Factor which decreases during generation and is substantially lowered by alignment tuning, explaining their stability and insensitivity to decoding strategies.

Authors:Shahab Rahimirad, Guven Gergerli, Lucia Romero, Angela Qian, Matthew Lyle Olson, Simon Stepputtis, Joseph Campbell
Title: Bayesian Social Deduction with Graph-Informed Language Models
Abstract:
Social reasoning - inferring unobservable beliefs and intentions from partial observations of other agents - remains a challenging task for large language models (LLMs). We evaluate the limits of current reasoning language models in the social deduction game Avalon and find that while the largest models demonstrate strong performance, they require extensive test-time inference and degrade sharply when distilled to smaller, real-time-capable variants. To address this, we introduce a hybrid reasoning framework that externalizes belief inference to a structured probabilistic model, while using an LLM for language understanding and interaction. Our approach achieves competitive performance with much larger models in Agent-Agent play and, notably, is the first language agent to defeat human players in a controlled study - achieving a 67% win rate and receiving higher qualitative ratings than both reasoning baselines and human teammates. We release code, models, and a dataset to support future work on social reasoning in LLM agents, which can be found at https://camp-lab-purdue.github.io/bayesian-social-deduction/
Chinese Summary: 该研究提出了一种混合推理框架,将大型语言模型与结构化概率推理相结合,在社交推理游戏《阿瓦隆》中不仅取得了与更大模型相媲美的表现,还首次在受控研究中击败了人类玩家。
English Summary: The study introduces a hybrid reasoning framework that combines large language models with structured probabilistic inference, achieving competitive performance and even surpassing human players in the social deduction game Avalon.

Authors:Fabien Furfaro
Title: TPTT: Transforming Pretrained Transformers into Titans
Abstract:
Transformer-based large language models (LLMs) have achieved strong performance across many natural language processing tasks. Nonetheless, their quadratic computational and memory requirements, particularly in self-attention layers, pose challenges for efficient inference on long contexts and for deployment in resource-limited environments. We present TPTT (Transforming Pretrained Transformers into Titans), a framework designed to augment pretrained Transformers with linearized attention (LiZA) and internal memory gating via Memory as Gate (MaG), applied without full retraining. TPTT supports parameter-efficient fine-tuning (LoRA) and integrates with standard toolkits such as Hugging Face Transformers. We evaluated TPTT on several pretrained models, including Llama-1B, OlMoE-1B-7B, Qwen2.5-1.5B, Gemma3-270m, OpenELM-1.3B, and Mistral-7B, in order to assess applicability across architectures of different scales. Experiments on models with approximately 1 billion parameters, evaluated primarily on the MMLU benchmark, suggest potential improvements in both efficiency and accuracy compared to baseline models. For example, Titans-Llama-1B exhibited up to a 20\% relative increase in Exact Match scores in one-shot evaluation. An additional finding is that it is possible to convert a quadratic-attention model into a purely linear-attention model using the DeltaProduct mechanism. All training runs were carried out with modest computational resources. These preliminary findings indicate that TPTT may help adapt pretrained LLMs for long-context tasks with limited overhead. Further studies on larger models and a broader set of benchmarks will be necessary to evaluate the generality and robustness of the framework. Code is available at https://github.com/fabienfrfr/tptt . Python package at https://pypi.org/project/tptt/ .
中文:TPTT框架通过线性化注意力和内存门控增强预训练Transformer,可在有限资源下高效适应长文本任务,同时保持或提升模型性能。
English: The TPTT framework enhances pretrained Transformers with linearized attention and memory gating, enabling efficient adaptation for long-context tasks while maintaining or improving performance with minimal computational resources.

Authors:Yang Wu, Yifan Zhang, Yurong Wu, Yuran Wang, Junkai Zhang, Jian Cheng
Title: Step-Opt: Boosting Optimization Modeling in LLMs through Iterative Data Synthesis and Structured Validation
Abstract:
Large Language Models (LLMs) have revolutionized various domains but encounter substantial challenges in tackling optimization modeling tasks for Operations Research (OR), particularly when dealing with complex problem. In this work, we propose Step-Opt-Instruct, a framework that augments existing datasets and generates high-quality fine-tuning data tailored to optimization modeling. Step-Opt-Instruct employs iterative problem generation to systematically increase problem complexity and stepwise validation to rigorously verify data, preventing error propagation and ensuring the quality of the generated dataset. Leveraging this framework, we fine-tune open-source LLMs, including LLaMA-3-8B and Mistral-7B, to develop Step-Opt--a model that achieves state-of-the-art performance on benchmarks such as NL4OPT, MAMO, and IndustryOR. Extensive experiments demonstrate the superior performance of Step-Opt, especially in addressing complex OR tasks, with a notable 17.01\% improvement in micro average accuracy on difficult problems. These findings highlight the effectiveness of combining structured validation with gradual problem refinement to advance the automation of decision-making processes using LLMs.The code and dataset are available at https://github.com/samwu-learn/Step.
中文摘要:本文提出Step-Opt-Instruct框架,通过迭代式问题生成和逐步验证机制为运筹学优化建模生成高质量训练数据,基于此开发的Step-Opt模型在多项基准测试中实现最优性能,尤其在复杂问题上取得17.01%的显著准确率提升。
English Summary: This paper introduces Step-Opt-Instruct, a framework that enhances optimization modeling for Operations Research by generating high-quality training data through iterative complexity escalation and stepwise validation, resulting in the Step-Opt model which achieves state-of-the-art performance with significant accuracy improvements on complex tasks.

Authors:Kailing Li, Qi'ao Xu, Tianwen Qian, Yuqian Fu, Yang Jiao, Xiaoling Wang
Title: CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
Abstract:
Embodied Visual Reasoning (EVR) seeks to follow complex, free-form instructions based on egocentric video, enabling semantic understanding and spatiotemporal reasoning in dynamic environments. Despite its promising potential, EVR encounters significant challenges stemming from the diversity of complex instructions and the intricate spatiotemporal dynamics in long-term egocentric videos. Prior solutions either employ Large Language Models (LLMs) over static video captions, which often omit critical visual details, or rely on end-to-end Vision-Language Models (VLMs) that struggle with stepwise compositional reasoning. Consider the complementary strengths of LLMs in reasoning and VLMs in perception, we propose CLiViS. It is a novel training-free framework that leverages LLMs for high-level task planning and orchestrates VLM-driven open-world visual perception to iteratively update the scene context. Building on this synergy, the core of CLiViS is a dynamic Cognitive Map that evolves throughout the reasoning process. This map constructs a structured representation of the embodied scene, bridging low-level perception and high-level reasoning. Extensive experiments across multiple benchmarks demonstrate the effectiveness and generality of CLiViS, especially in handling long-term visual dependencies. Code is available at https://github.com/Teacher-Tom/CLiViS.
中文: CLiViS是一种无需训练的框架,它结合了大型语言模型的任务规划能力和视觉语言模型的感知能力,通过动态认知地图连接感知与推理,以在复杂环境中实现高效的具身视觉推理。
English: CLiViS is a training-free framework that synergizes LLMs for task planning and VLMs for visual perception, using a dynamic Cognitive Map to bridge perception and reasoning for effective embodied visual reasoning in complex environments.

Authors:Jinhao Duan, James Diffenderfer, Sandeep Madireddy, Tianlong Chen, Bhavya Kailkhura, Kaidi Xu
Title: UProp: Investigating the Uncertainty Propagation of LLMs in Multi-Step Agentic Decision-Making
Abstract:
As Large Language Models (LLMs) are integrated into safety-critical applications involving sequential decision-making in the real world, it is essential to know when to trust LLM decisions. Existing LLM Uncertainty Quantification (UQ) methods are primarily designed for single-turn question-answering formats, resulting in multi-step decision-making scenarios, e.g., LLM agentic system, being underexplored. In this paper, we introduce a principled, information-theoretic framework that decomposes LLM sequential decision uncertainty into two parts: (i) internal uncertainty intrinsic to the current decision, which is focused on existing UQ methods, and (ii) extrinsic uncertainty, a Mutual-Information (MI) quantity describing how much uncertainty should be inherited from preceding decisions. We then propose UProp, an efficient and effective extrinsic uncertainty estimator that converts the direct estimation of MI to the estimation of Pointwise Mutual Information (PMI) over multiple Trajectory-Dependent Decision Processes (TDPs). UProp is evaluated over extensive multi-step decision-making benchmarks, e.g., AgentBench and HotpotQA, with state-of-the-art LLMs, e.g., GPT-4.1 and DeepSeek-V3. Experimental results demonstrate that UProp significantly outperforms existing single-turn UQ baselines equipped with thoughtful aggregation strategies. Moreover, we provide a comprehensive analysis of UProp, including sampling efficiency, potential applications, and intermediate uncertainty propagation, to demonstrate its effectiveness. Codes will be available at https://github.com/jinhaoduan/UProp.
中文: 本文提出UProp框架,将大语言模型序列决策的不确定性分解为内在和外在两部分,在多步决策基准测试中显著优于现有方法。
English: This paper introduces UProp, a novel framework that decomposes LLM sequential decision uncertainty into intrinsic and extrinsic components, significantly outperforming existing methods in multi-step decision-making benchmarks.

Authors:Jiale Zhang, Jiaxiang Chen, Zhucong Li, Jie Ding, Kui Zhao, Zenglin Xu, Xin Pang, Yinghui Xu
Title: SlimRAG: Retrieval without Graphs via Entity-Aware Context Selection
Abstract:
Retrieval-Augmented Generation (RAG) enhances language models by incorporating external knowledge at inference time. However, graph-based RAG systems often suffer from structural overhead and imprecise retrieval: they require costly pipelines for entity linking and relation extraction, yet frequently return subgraphs filled with loosely related or tangential content. This stems from a fundamental flaw -- semantic similarity does not imply semantic relevance. We introduce SlimRAG, a lightweight framework for retrieval without graphs. SlimRAG replaces structure-heavy components with a simple yet effective entity-aware mechanism. At indexing time, it constructs a compact entity-to-chunk table based on semantic embeddings. At query time, it identifies salient entities, retrieves and scores associated chunks, and assembles a concise, contextually relevant input -- without graph traversal or edge construction. To quantify retrieval efficiency, we propose Relative Index Token Utilization (RITU), a metric measuring the compactness of retrieved content. Experiments across multiple QA benchmarks show that SlimRAG outperforms strong flat and graph-based baselines in accuracy while reducing index size and RITU (e.g., 16.31 vs. 56+), highlighting the value of structure-free, entity-centric context selection. The code will be released soon. https://github.com/continue-ai-company/SlimRAG
中文:SlimRAG提出了一种轻量级、以实体为中心的框架,通过摒弃图结构提升了检索精度与效率,在多项基准测试中以更小的索引规模和更简洁的检索机制实现了优于现有方法的准确率。
English: SlimRAG introduces a lightweight, entity-centric framework that eliminates graph structures to enhance retrieval precision and efficiency, outperforming existing methods in accuracy while significantly reducing index size and retrieval complexity.

Authors:Adithya Bhaskar, Alexander Wettig, Tianyu Gao, Yihe Dong, Danqi Chen
Title: Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Abstract:
Language models handle increasingly long contexts for tasks such as book summarization, but this leads to growing memory costs for the key-value (KV) cache. Many prior works have proposed ways of discarding KVs from memory, but their approaches are tailored to favorable settings, obscuring caveats like high peak memory and performance degradation, and a fair comparison between methods is difficult. In this paper, we propose the *KV footprint* as a unified metric, which accounts for both the amount of KV entries stored and their lifespan in memory. We evaluate methods based on the smallest footprint they attain while preserving performance in both long-context understanding and generation, with context lengths of up to 128K tokens. This metric reveals the high peak memory of prior KV eviction methods. One class of methods -- *post-fill eviction* -- has a high footprint due to being incompatible with eviction during pre-filling. We adapt these methods to be able to evict KVs during pre-filling, achieving substantially lower KV footprints. We then turn to *recency eviction* methods, wherein we propose PruLong, an end-to-end optimization method for learning which attention heads need to retain the full KV cache and which do not. PruLong saves memory while preserving long-context performance, achieving 12% smaller KV footprint than prior methods while retaining performance in challenging recall tasks. Our paper clarifies the complex tangle of long-context inference methods and paves the way for future development to minimize the KV footprint.
Chinese: 本文提出KV占用作为统一评估指标,揭示了现有键值缓存管理方法的局限性,并通过改进策略显著降低了内存使用,同时保持了长上下文任务中的性能表现。
English: This paper introduces the KV footprint as a unified metric to evaluate key-value cache management methods, revealing limitations in prior approaches and proposing adaptations that significantly reduce memory usage while maintaining performance in long-context tasks.

Authors:Shoubin Yu, Yue Zhang, Ziyang Wang, Jaehong Yoon, Mohit Bansal
Title: MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
Abstract:
Combining pre-trained expert models offers substantial potential for scalable multimodal reasoning, but building a unified framework remains challenging due to the increasing diversity of input modalities and task complexity. For instance, medical diagnosis requires precise reasoning over structured clinical tables, while financial forecasting depends on interpreting plot-based data to make informed predictions. To tackle this challenge, we introduce MEXA, a training-free framework that performs modality- and task-aware aggregation of multiple expert models to enable effective multimodal reasoning across diverse and distinct domains. MEXA dynamically selects expert models based on the input modality and the task-specific reasoning demands (i.e., skills). Each expert model, specialized in a modality task pair, generates interpretable textual reasoning outputs. MEXA then aggregates and reasons over these outputs using a Large Reasoning Model (LRM) to produce the final answer. This modular design allows flexible and transparent multimodal reasoning across diverse domains without additional training overhead. We extensively evaluate our approach on diverse multimodal benchmarks, including Video Reasoning, Audio Reasoning, 3D Understanding, and Medical QA. MEXA consistently delivers performance improvements over strong multimodal baselines, highlighting the effectiveness and broad applicability of our expert-driven selection and aggregation in diverse multimodal reasoning tasks.
中文:MEXA是一种无需训练的框架,能根据输入模态和任务需求动态选择和聚合专业专家模型,无需额外训练即可实现跨领域的透明高效多模态推理。
English: MEXA is a training-free framework that dynamically selects and aggregates specialized expert models based on input modality and task demands, enabling transparent and effective multimodal reasoning across diverse domains without additional training.

Authors:Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
Title: Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation
Abstract:
Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM's internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://github.com/ECNU-Text-Computing/cot-hallu-detect .
中文:思维链提示方法能减少大语言模型的幻觉,但会掩盖检测所需的关键信号,从而削弱各种检测方法的有效性,揭示了推理与检测之间的权衡。
English: Chain-of-Thought prompting reduces hallucinations in Large Language Models but impairs detection methods by obscuring critical signals, revealing a trade-off between reasoning and detection effectiveness.

Authors:Sahil Kale, Vijaykant Nadadur
Title: TeXpert: A Multi-Level Benchmark for Evaluating LaTeX Code Generation by LLMs
Abstract:
LaTeX's precision and flexibility in typesetting have made it the gold standard for the preparation of scientific documentation. Large Language Models (LLMs) present a promising opportunity for researchers to produce publication-ready material using LaTeX with natural language instructions, yet current benchmarks completely lack evaluation of this ability. By introducing TeXpert, our benchmark dataset with natural language prompts for generating LaTeX code focused on components of scientific documents across multiple difficulty levels, we conduct an in-depth analysis of LLM performance in this regard and identify frequent error types. Our evaluation across open and closed-source LLMs highlights multiple key findings: LLMs excelling on standard benchmarks perform poorly in LaTeX generation with a significant accuracy drop-off as the complexity of tasks increases; open-source models like DeepSeek v3 and DeepSeek Coder strongly rival closed-source counterparts in LaTeX tasks; and formatting and package errors are unexpectedly prevalent, suggesting a lack of diverse LaTeX examples in the training datasets of most LLMs. Our dataset, code, and model evaluations are available at https://github.com/knowledge-verse-ai/TeXpert.
中文: LaTeX的精确性使其成为科学文档的理想选择,而TeXpert基准测试表明,尽管大型语言模型在生成复杂LaTeX代码时准确性下降,但像DeepSeek这样的开源模型与闭源模型表现相当,同时揭示了因训练数据不足导致的常见格式错误。
English: LaTeX's precision makes it ideal for scientific documents, and the TeXpert benchmark reveals that while LLMs struggle with generating accurate LaTeX code as complexity increases, open-source models like DeepSeek compete closely with closed-source ones, exposing common formatting errors due to limited training data.

Authors:Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao
Title: IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
Abstract:
Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under [this https URL](https://github.com/AI45Lab/IS-Bench).
中文:IS-Bench作为首个评估具身智能体交互安全性的多模态基准,通过对主流视觉语言模型的广泛测试,揭示了当前模型缺乏风险意识以及安全导向推理存在的任务完成度妥协问题。
English: IS-Bench is introduced as the first multimodal benchmark for evaluating interactive safety in embodied agents, revealing current models' lack of risk awareness and the trade-offs of safety-focused reasoning through extensive testing on leading VLMs.

Authors:Yao Lu, Zhaiyuan Ji, Jiawei Du, Yu Shanqing, Qi Xuan, Tianyi Zhou
Title: From LLM-anation to LLM-orchestrator: Coordinating Small Models for Data Labeling
Abstract:
Although the annotation paradigm based on Large Language Models (LLMs) has made significant breakthroughs in recent years, its actual deployment still has two core bottlenecks: first, the cost of calling commercial APIs in large-scale annotation is very expensive; second, in scenarios that require fine-grained semantic understanding, such as sentiment classification and toxicity classification, the annotation accuracy of LLMs is even lower than that of Small Language Models (SLMs) dedicated to this field. To address these problems, we propose a new paradigm of multi-model cooperative annotation and design a fully automatic annotation framework AutoAnnotator based on this. Specifically, AutoAnnotator consists of two layers. The upper-level meta-controller layer uses the generation and reasoning capabilities of LLMs to select SLMs for annotation, automatically generate annotation code and verify difficult samples; the lower-level task-specialist layer consists of multiple SLMs that perform annotation through multi-model voting. In addition, we use the difficult samples obtained by the secondary review of the meta-controller layer as the reinforcement learning set and fine-tune the SLMs in stages through a continual learning strategy, thereby improving the generalization of SLMs. Extensive experiments show that AutoAnnotator outperforms existing open-source/API LLMs in zero-shot, one-shot, CoT, and majority voting settings. Notably, AutoAnnotator reduces the annotation cost by 74.15% compared to directly annotating with GPT-3.5-turbo, while still improving the accuracy by 6.21%. Project page: https://github.com/Zhaiyuan-Ji/AutoAnnotator.
中文: AutoAnnotator框架通过LLM元控制器选择专用SLM并进行投票标注的双层架构,解决了LLM在细粒度标注中成本高、精度低的问题,相比GPT-3.5-turbo实现标注成本降低74.15%且准确率提升6.21%。
English: The AutoAnnotator framework addresses the high cost and low accuracy of LLMs in fine-grained annotation by using a two-layer system where an LLM meta-controller selects and codes for specialized SLMs that perform voting-based annotation, achieving a 74.15% cost reduction and 6.21% accuracy improvement over GPT-3.5-turbo.

Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Junhao Cheng, Ying Shan, Xihui Liu
Title: GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning
Abstract:
Recent reinforcement learning approaches, such as outcome-supervised GRPO, have advanced Chain-of-Thought reasoning in large language models (LLMs), yet their adaptation to multimodal LLMs (MLLMs) is unexplored. To address the lack of rigorous evaluation for MLLM post-training methods, we introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning. It offers a large training set and evaluates generalization across three escalating challenges: in-distribution, cross-environment, and cross-environment-task scenarios. Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate. This stems from reward signals focusing solely on final answers, encouraging shortcuts, and strict KL penalties limiting exploration.To address this, we propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision. GRPO-CARE introduces a two-tiered reward: (1) a base reward for answer correctness, and (2) an adaptive consistency bonus, computed by comparing the model's reasoning-to-answer likelihood (via a slowly-evolving reference model) against group peers.This dual mechanism amplifies rewards for reasoning paths that are both correct and logically consistent. Replacing KL penalties with this adaptive bonus, GRPO-CARE outperforms standard GRPO on SEED-Bench-R1, achieving a 6.7% performance gain on the hardest evaluation level and a 24.5% improvement in consistency. It also shows strong transferability, improving model performance across diverse video understanding benchmarks. Our work contributes a systematically designed benchmark and a generalizable post-training framework, advancing the development of more interpretable and robust MLLMs.
中文: 本研究提出了用于评估多模态大语言模型后训练方法的基准SEED-Bench-R1,并开发了GRPO-CARE一致性感知强化学习框架,该框架在无需显式监督的情况下显著提升了答案准确性和推理连贯性,实现了明显的性能改进。
English: This study introduces SEED-Bench-R1, a benchmark for evaluating multimodal large language models' post-training methods, and proposes GRPO-CARE, a consistency-aware reinforcement learning framework that enhances both answer accuracy and reasoning coherence, achieving significant performance gains without explicit supervision.

Authors:Tianle Gu, Kexin Huang, Zongqi Wang, Yixu Wang, Jie Li, Yuanqi Yao, Yang Yao, Yujiu Yang, Yan Teng, Yingchun Wang
Title: Probing the Robustness of Large Language Models Safety to Latent Perturbations
Abstract:
Safety alignment is a key requirement for building reliable Artificial General Intelligence. Despite significant advances in safety alignment, we observe that minor latent shifts can still trigger unsafe responses in aligned models. We argue that this stems from the shallow nature of existing alignment methods, which focus on surface-level refusal behaviors without sufficiently altering internal representations. Consequently, small shifts in hidden activations can re-trigger harmful behaviors embedded in the latent space. To explore the robustness of safety alignment to latent perturbations, we introduce a probing method that measures the Negative Log-Likelihood of the original response generated by the model. This probe quantifies local sensitivity in the latent space, serving as a diagnostic tool for identifying vulnerable directions. Based on this signal, we construct effective jailbreak trajectories, giving rise to the Activation Steering Attack (ASA). More importantly, these insights offer a principled foundation for improving alignment robustness. To this end, we introduce Layer-wise Adversarial Patch Training~(LAPT), a fine-tuning strategy that inject controlled perturbations into hidden representations during training. Experimental results highlight that LAPT strengthen alignment robustness without compromising general capabilities. Our findings reveal fundamental flaws in current alignment paradigms and call for representation-level training strategies that move beyond surface-level behavior supervision. Codes and results are available at https://github.com/Carol-gutianle/LatentSafety.
中文摘要:现有AI模型的安全对齐方法过于浅层,易受潜在偏移影响而引发不安全响应,但提出的分层对抗补丁训练(LAPT)通过针对性优化内部表征,有效提升了安全鲁棒性。
English Summary: Current safety alignment methods for AI models are shallow, making them vulnerable to latent shifts that can trigger unsafe responses, but the proposed Layer-wise Adversarial Patch Training (LAPT) enhances robustness by addressing these internal weaknesses.

Authors:Markus Frohmann, Gabriel Meseguer-Brocal, Markus Schedl, Elena V. Epure
Title: Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion
Abstract:
The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.
中文: 提出的DE-detect方法采用多模态方案,结合从音频中提取的转录歌词和语音特征,能有效识别AI生成的音乐,对音频干扰具有更强鲁棒性,在实际应用中优于现有检测器。
English: The proposed DE-detect method uses a multimodal approach combining transcribed sung lyrics and speech features from audio to effectively identify AI-generated music, offering greater robustness against audio perturbations and outperforming existing detectors in real-world applications.

Authors:Myke C. Cohen, Zhe Su, Hsien-Te Kao, Daniel Nguyen, Spencer Lynch, Maarten Sap, Svitlana Volkova
Title: Exploring Big Five Personality and AI Capability Effects in LLM-Simulated Negotiation Dialogues
Abstract:
This paper presents an evaluation framework for agentic AI systems in mission-critical negotiation contexts, addressing the need for AI agents that can adapt to diverse human operators and stakeholders. Using Sotopia as a simulation testbed, we present two experiments that systematically evaluated how personality traits and AI agent characteristics influence LLM-simulated social negotiation outcomes--a capability essential for a variety of applications involving cross-team coordination and civil-military interactions. Experiment 1 employs causal discovery methods to measure how personality traits impact price bargaining negotiations, through which we found that Agreeableness and Extraversion significantly affect believability, goal achievement, and knowledge acquisition outcomes. Sociocognitive lexical measures extracted from team communications detected fine-grained differences in agents' empathic communication, moral foundations, and opinion patterns, providing actionable insights for agentic AI systems that must operate reliably in high-stakes operational scenarios. Experiment 2 evaluates human-AI job negotiations by manipulating both simulated human personality and AI system characteristics, specifically transparency, competence, adaptability, demonstrating how AI agent trustworthiness impact mission effectiveness. These findings establish a repeatable evaluation methodology for experimenting with AI agent reliability across diverse operator personalities and human-agent team dynamics, directly supporting operational requirements for reliable AI systems. Our work advances the evaluation of agentic AI workflows by moving beyond standard performance metrics to incorporate social dynamics essential for mission success in complex operations.

Authors:Sheng Liu, Tianlang Chen, Pan Lu, Haotian Ye, Yizheng Chen, Lei Xing, James Zou
Title: Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute
Abstract:
Test-time compute has emerged as a powerful paradigm for improving the performance of large language models (LLMs), where generating multiple outputs or refining individual chains can significantly boost answer accuracy. However, existing methods like Best-of-N, majority voting, and self-reflection typically apply reasoning in a uniform way across inputs, overlooking the fact that different problems may require different levels of reasoning depth. In this work, we propose Fractional Reasoning, a training-free and model-agnostic framework that enables continuous control over reasoning intensity at inference time, going beyond the limitations of fixed instructional prompts. Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor, allowing the model to tailor its reasoning process to the complexity of each input. This supports two key modes of test-time scaling: (1) improving output quality in breadth-based strategies (e.g., Best-of-N, majority voting), and (2) enhancing the correctness of individual reasoning chains in depth-based strategies (e.g., self-reflection). Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
中文:Fractional Reasoning是一种无需训练的框架,通过在推理时缩放潜在引导向量来实现对推理强度的连续控制,从而在多种推理任务中同步提升广度策略和深度策略的测试时计算效果。
English: Fractional Reasoning is a training-free framework that enables continuous control over reasoning intensity during inference by scaling latent steering vectors, improving both breadth-based and depth-based test-time compute strategies across various reasoning tasks.

Authors:Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
Title: GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Abstract:
Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V. However, deploying these models in real-world scenarios, particularly on resource-constrained devices, remains challenging due to their substantial computational demands. This has spurred interest in distilling knowledge from large VLMs into smaller, more efficient counterparts. A key challenge arises here from the diversity of VLM architectures, which are built on different LLMs and employ varying token types-differing in vocabulary size, token splits, and token index ordering. To address this challenge of limitation to a specific VLM type, we present Generation after Recalibration (GenRecal), a novel, general-purpose distillation framework for VLMs. GenRecal incorporates a Recalibrator that aligns and adapts feature representations between heterogeneous VLMs, enabling effective knowledge transfer across different types of VLMs. Through extensive experiments on multiple challenging benchmarks, we demonstrate that GenRecal significantly improves baseline performances, eventually outperforming large-scale open- and closed-source VLMs.

Authors:Yining Hong, Rui Sun, Bingxuan Li, Xingcheng Yao, Maxine Wu, Alexander Chien, Da Yin, Ying Nian Wu, Zhecan James Wang, Kai-Wei Chang
Title: Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence
Abstract:
AI agents today are mostly siloed - they either retrieve and reason over vast amount of digital information and knowledge obtained online; or interact with the physical world through embodied perception, planning and action - but rarely both. This separation limits their ability to solve tasks that require integrated physical and digital intelligence, such as cooking from online recipes, navigating with dynamic map data, or interpreting real-world landmarks using web knowledge. We introduce Embodied Web Agents, a novel paradigm for AI agents that fluidly bridge embodiment and web-scale reasoning. To operationalize this concept, we first develop the Embodied Web Agents task environments, a unified simulation platform that tightly integrates realistic 3D indoor and outdoor environments with functional web interfaces. Building upon this platform, we construct and release the Embodied Web Agents Benchmark, which encompasses a diverse suite of tasks including cooking, navigation, shopping, tourism, and geolocation - all requiring coordinated reasoning across physical and digital realms for systematic assessment of cross-domain intelligence. Experimental results reveal significant performance gaps between state-of-the-art AI systems and human capabilities, establishing both challenges and opportunities at the intersection of embodied cognition and web-scale knowledge access. All datasets, codes and websites are publicly available at our project page https://embodied-web-agent.github.io/.

Authors:Tevin Wang, Chenyan Xiong
Title: AutoRule: Reasoning Chain-of-thought Extracted Rule-based Rewards Improve Preference Learning
Abstract:
Rule-based rewards offer a promising strategy for improving reinforcement learning from human feedback (RLHF), but current approaches often rely on manual rule engineering. We present AutoRule, a fully automated method for extracting rules from preference feedback and formulating them into rule-based rewards. AutoRule extraction operates in three stages: it leverages a reasoning model to interpret user preferences, identifies candidate rules from the reasoning chain of these interpretations, and synthesizes them into a unified rule set. Leveraging the finalized rule set, we employ language-model verifiers to compute the fraction of rules satisfied by each output, using this metric as an auxiliary reward alongside the learned reward model during policy optimization. Training a Llama-3-8B model with AutoRule results in a 28.6\% relative improvement in length-controlled win rate on AlpacaEval2.0, and a 6.1\% relative gain in second-turn performance on a held-out MT-Bench subset, compared to a GRPO baseline trained with the same learned reward model but without the rule-based auxiliary reward. Our analysis confirms that the extracted rules exhibit good agreement with dataset preference. We find that AutoRule demonstrates reduced reward hacking compared to a learned reward model when run over two episodes. Finally, our case study suggests that the extracted rules capture unique qualities valued in different datasets. The extracted rules are provided in the appendix, and the code is open-sourced at https://github.com/cxcscmu/AutoRule.
Chinese: AutoRule通过从偏好反馈中自动提取规则并构建规则奖励,结合学习到的奖励模型强化训练,显著提升了模型在AlpacaEval2.0和MT-Bench等基准测试中的性能表现。
English: AutoRule automates the extraction of rules from preference feedback to create rule-based rewards, enhancing reinforcement learning by integrating these with learned reward models and significantly improving model performance on benchmarks like AlpacaEval2.0 and MT-Bench.

Authors:Shaoqing Lin, Chong Teng, Fei Li, Donghong Ji, Lizhen Qu, Zhuang Li
Title: DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement
Abstract:
Vision-Language Models (VLMs) generate discourse-level, multi-sentence visual descriptions, challenging text scene graph parsers built for single-sentence caption-to-graph mapping. Current approaches typically merge sentence-level parsing outputs for discourse input, often missing phenomena like cross-sentence coreference, resulting in fragmented graphs and degraded downstream VLM task performance. We introduce a new task, Discourse-level text Scene Graph parsing (DiscoSG), and release DiscoSG-DS, a dataset of 400 expert-annotated and 8,430 synthesised multi-sentence caption-graph pairs. Each caption averages 9 sentences, and each graph contains at least 3 times more triples than those in existing datasets. Fine-tuning GPT-4o on DiscoSG-DS yields over 40% higher SPICE than the strongest sentence-merging baseline. However, its high inference cost and licensing restrict open-source use, and smaller fine-tuned open-source models (e.g., Flan-T5) perform poorly on dense graph generation. To bridge this gap, we propose DiscoSG-Refiner, which drafts a base graph using a seed parser and iteratively refines it with a second model, improving robustness for complex graph generation. Using two small fine-tuned Flan-T5-Base models, DiscoSG-Refiner improves SPICE by approximately 30% over the baseline while achieving 86 times faster inference than GPT-4o. It also delivers consistent gains on downstream VLM tasks, including discourse-level caption evaluation and hallucination detection, outperforming alternative parsers. Code and data are available at https://github.com/ShaoqLin/DiscoSG .
中文: 为解决视觉语言模型中多句子描述解析的不足,研究提出了话语级场景图解析任务DiscoSG及相应数据集,并开发了DiscoSG-Refiner方法,该方法通过迭代优化显著提升了解析性能与效率,同时大幅优于现有基线模型。
English: Vision-Language Models require discourse-level scene graph parsing to overcome the limitations of sentence-merging approaches, leading to the introduction of DiscoSG task and dataset, and the development of DiscoSG-Refiner, which significantly enhances parsing performance and efficiency for downstream tasks.

Authors:Zhouhong Gu, Xiaoxuan Zhu, Yin Cai, Hao Shen, Xingzhou Chen, Qingyi Wang, Jialin Li, Xiaoran Shi, Haoran Guo, Wenxuan Huang, Hongwei Feng, Yanghua Xiao, Zheyu Ye, Yao Hu, Shaosheng Cao
Title: AgentGroupChat-V2: Divide-and-Conquer Is What LLM-Based Multi-Agent System Need
Abstract:
Large language model based multi-agent systems have demonstrated significant potential in social simulation and complex task resolution domains. However, current frameworks face critical challenges in system architecture design, cross-domain generalizability, and performance guarantees, particularly as task complexity and number of agents increases. We introduces AgentGroupChat-V2, a novel framework addressing these challenges through three core innovations: (1) a divide-and-conquer fully parallel architecture that decomposes user queries into hierarchical task forest structures enabling dependency management and distributed concurrent processing. (2) an adaptive collaboration engine that dynamically selects heterogeneous LLM combinations and interaction modes based on task characteristics. (3) agent organization optimization strategies combining divide-and-conquer approaches for efficient problem decomposition. Extensive experiments demonstrate AgentGroupChat-V2's superior performance across diverse domains, achieving 91.50% accuracy on GSM8K (exceeding the best baseline by 5.6 percentage points), 30.4% accuracy on competition-level AIME (nearly doubling other methods), and 79.20% pass@1 on HumanEval. Performance advantages become increasingly pronounced with higher task difficulty, particularly on Level 5 MATH problems where improvements exceed 11 percentage points compared to state-of-the-art baselines. These results confirm that AgentGroupChat-V2 provides a comprehensive solution for building efficient, general-purpose LLM multi-agent systems with significant advantages in complex reasoning scenarios. Code is available at https://github.com/MikeGu721/AgentGroupChat-V2.
Chinese: AgentGroupChat-V2提出了一种具有并行架构和自适应协作引擎的新型多智能体框架,在多个基准测试的复杂推理任务中展现出卓越性能。
English: AgentGroupChat-V2 introduces a novel multi-agent framework with a parallel architecture and adaptive collaboration engine, demonstrating superior performance in complex reasoning tasks across multiple benchmarks.

Authors:Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Title: video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Abstract:
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2's captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
Chinese: Video-SALMONN 2 通过多轮直接偏好优化(MrDPO)和字幕质量目标,在视频描述和问答任务中实现了最先进的性能,超越了GPT-4o和Gemini-1.5 Pro等专有系统,并在多个基准测试中表现优异。
English: Video-SALMONN 2 introduces multi-round direct preference optimization (MrDPO) with a caption-quality objective, achieving state-of-the-art results in video description and question answering across multiple benchmarks while outperforming proprietary systems like GPT-4o and Gemini-1.5 Pro.

Authors:Junke Wang, Hongshun Ling, Li Zhang, Longqian Zhang, Fang Wang, Yuan Gao, Zhi Li
Title: CKD-EHR:Clinical Knowledge Distillation for Electronic Health Records
Abstract:
Electronic Health Records (EHR)-based disease prediction models have demonstrated significant clinical value in promoting precision medicine and enabling early intervention. However, existing large language models face two major challenges: insufficient representation of medical knowledge and low efficiency in clinical deployment. To address these challenges, this study proposes the CKD-EHR (Clinical Knowledge Distillation for EHR) framework, which achieves efficient and accurate disease risk prediction through knowledge distillation techniques. Specifically, the large language model Qwen2.5-7B is first fine-tuned on medical knowledge-enhanced data to serve as the teacher model.It then generates interpretable soft labels through a multi-granularity attention distillation mechanism. Finally, the distilled knowledge is transferred to a lightweight BERT student model. Experimental results show that on the MIMIC-III dataset, CKD-EHR significantly outperforms the baseline model:diagnostic accuracy is increased by 9%, F1-score is improved by 27%, and a 22.2 times inference speedup is achieved. This innovative solution not only greatly improves resource utilization efficiency but also significantly enhances the accuracy and timeliness of diagnosis, providing a practical technical approach for resource optimization in clinical settings. The code and data for this research are available athttps://github.com/209506702/CKD_EHR.
Chinese: 本研究提出CKD-EHR框架,通过知识蒸馏技术将经过医学知识增强的大型语言模型作为教师模型,将其知识迁移至轻量级BERT学生模型,在MIMIC-III数据集上显著提升了诊断准确率、F1分数和推理速度。
English: This study introduces the CKD-EHR framework, which uses knowledge distillation to enhance disease prediction by fine-tuning a large language model as a teacher and transferring its knowledge to a lightweight BERT model, achieving significant improvements in accuracy, F1-score, and inference speed on the MIMIC-III dataset.

Authors:Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber
Title: Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Abstract:
Evaluating open-ended long-form generation is challenging because it is hard to define what clearly separates good from bad outputs. Existing methods often miss key aspects like coherence, style, or relevance, or are biased by pretraining data, making open-ended long-form evaluation an underexplored problem. To address this gap, we propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO and guiding its training with distinct rewards for good and bad outputs. Trained on two response evaluation datasets with diverse long-form styles and Likert-rated quality, PrefBERT effectively supports GRPO by offering better semantic reward feedback than traditional metrics ROUGE-L and BERTScore do. Through comprehensive evaluations, including LLM-as-a-judge, human ratings, and qualitative analysis, we show that PrefBERT, trained on multi-sentence and paragraph-length responses, remains reliable across varied long passages and aligns well with the verifiable rewards GRPO needs. Human evaluations confirm that using PrefBERT as the reward signal to train policy models yields responses better aligned with human preferences than those trained with traditional metrics. Our code is available at https://github.com/zli12321/long_form_rl.
Chinese: PrefBERT是一种新颖的评分模型,旨在通过提供语义奖励反馈来评估开放式生成长文本,其表现优于ROUGE-L和BERTScore等传统指标,并在训练策略模型时更好地符合人类偏好。
English: PrefBERT is a novel scoring model designed to evaluate open-ended long-form generation by providing semantic reward feedback, outperforming traditional metrics like ROUGE-L and BERTScore and aligning better with human preferences in training policy models.

Authors:Zhoujun Cheng, Shibo Hao, Tianyang Liu, Fan Zhou, Yutao Xie, Feng Yao, Yuexin Bian, Yonghao Zhuang, Nilabjo Dey, Yuheng Zha, Yi Gu, Kun Zhou, Yuqi Wang, Yuan Li, Richard Fan, Jianshu She, Chengqian Gao, Abulhair Saparov, Haonan Li, Taylor W. Killian, Mikhail Yurochkin, Zhengzhong Liu, Eric P. Xing, Zhiting Hu
Title: Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective
Abstract:
Reinforcement learning (RL) has emerged as a promising approach to improve large language model (LLM) reasoning, yet most open efforts focus narrowly on math and code, limiting our understanding of its broader applicability to general reasoning. A key challenge lies in the lack of reliable, scalable RL reward signals across diverse reasoning domains. We introduce Guru, a curated RL reasoning corpus of 92K verifiable examples spanning six reasoning domains--Math, Code, Science, Logic, Simulation, and Tabular--each built through domain-specific reward design, deduplication, and filtering to ensure reliability and effectiveness for RL training. Based on Guru, we systematically revisit established findings in RL for LLM reasoning and observe significant variation across domains. For example, while prior work suggests that RL primarily elicits existing knowledge from pretrained models, our results reveal a more nuanced pattern: domains frequently seen during pretraining (Math, Code, Science) easily benefit from cross-domain RL training, while domains with limited pretraining exposure (Logic, Simulation, and Tabular) require in-domain training to achieve meaningful performance gains, suggesting that RL is likely to facilitate genuine skill acquisition. Finally, we present Guru-7B and Guru-32B, two models that achieve state-of-the-art performance among open models RL-trained with publicly available data, outperforming best baselines by 7.9% and 6.7% on our 17-task evaluation suite across six reasoning domains. We also show that our models effectively improve the Pass@k performance of their base models, particularly on complex tasks less likely to appear in pretraining data. We release data, models, training and evaluation code to facilitate general-purpose reasoning at: https://github.com/LLM360/Reasoning360
中文: Guru语料库通过涵盖六个领域的9.2万个多样化推理实例,解决了强化学习在语言模型中应用范围有限的问题,其Guru-7B和Guru-32B模型实现了最先进性能,证明强化学习既能激发既有知识,也能在预训练不足的领域促成真正的技能习得。
English: The Guru corpus introduces 92K diverse reasoning examples across six domains to address the limited scope of reinforcement learning in language models, enabling models like Guru-7B and Guru-32B to achieve state-of-the-art performance by demonstrating that RL can both elicit existing knowledge and foster genuine skill acquisition in underrepresented areas.

Authors:Li-Wei Chen, Takuya Higuchi, Zakaria Aldeneh, Ahmed Hussen Abdelaziz, Alexander Rudnicky
Title: A Variational Framework for Improving Naturalness in Generative Spoken Language Models
Abstract:
The success of large language models in text processing has inspired their adaptation to speech modeling. However, since speech is continuous and complex, it is often discretized for autoregressive modeling. Speech tokens derived from self-supervised models (known as semantic tokens) typically focus on the linguistic aspects of speech but neglect prosodic information. As a result, models trained on these tokens can generate speech with reduced naturalness. Existing approaches try to fix this by adding pitch features to the semantic tokens. However, pitch alone cannot fully represent the range of paralinguistic attributes, and selecting the right features requires careful hand-engineering. To overcome this, we propose an end-to-end variational approach that automatically learns to encode these continuous speech attributes to enhance the semantic tokens. Our approach eliminates the need for manual extraction and selection of paralinguistic features. Moreover, it produces preferred speech continuations according to human raters. Code, samples and models are available at https://github.com/b04901014/vae-gslm.
Chinese Summary: 本文提出了一种端到端的变分方法,能自动学习将连续语音属性编码到语义标记中,无需手动特征工程,从而提高了生成语音的自然度。
English Summary: This paper introduces an end-to-end variational method that automatically learns to encode continuous speech attributes into semantic tokens, eliminating manual feature engineering and improving speech naturalness in generated outputs.

Authors:Zhengxiang Cheng, Dongping Chen, Mingyang Fu, Tianyi Zhou
Title: Optimizing Length Compression in Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.
中文: LC-R1采用基于群组相对策略优化的后训练方法,通过简洁性和充分性原则结合长度与压缩奖励,在推理链长度减少约50%的同时仅造成约2%的准确率下降。
English: LC-R1, a post-training method using Group Relative Policy Optimization, significantly reduces reasoning chain length by 50% with minimal accuracy loss by applying Brevity and Sufficiency principles through Length and Compress Rewards.

Authors:Hengyuan Zhang, Xinrong Chen, Yingmin Qiu, Xiao Liang, Ziyue Li, Guanyu Wang, Weiping Li, Tong Mo, Hayden Kwok-Hay So, Ngai Wong
Title: GuiLoMo: Allocating Expert Number and Rank for LoRA-MoE via Bilevel Optimization with GuidedSelection Vectors
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), offer an efficient way to adapt large language models with reduced computational costs. However, their performance is limited by the small number of trainable parameters. Recent work combines LoRA with the Mixture-of-Experts (MoE), i.e., LoRA-MoE, to enhance capacity, but two limitations remain in hindering the full exploitation of its potential: 1) the influence of downstream tasks when assigning expert numbers, and 2) the uniform rank assignment across all LoRA experts, which restricts representational diversity. To mitigate these gaps, we propose GuiLoMo, a fine-grained layer-wise expert numbers and ranks allocation strategy with GuidedSelection Vectors (GSVs). GSVs are learned via a prior bilevel optimization process to capture both model- and task-specific needs, and are then used to allocate optimal expert numbers and ranks. Experiments on three backbone models across diverse benchmarks show that GuiLoMo consistently achieves superior or comparable performance to all baselines. Further analysis offers key insights into how expert numbers and ranks vary across layers and tasks, highlighting the benefits of adaptive expert configuration. Our code is available at https://github.com/Liar406/Gui-LoMo.git.
中文: GuiLoMo提出了一种基于引导选择向量的细粒度策略,逐层自适应分配专家数量和秩,从而在保持效率的同时提升了LoRA-MoE在不同任务和模型上的性能表现。
English: GuiLoMo introduces a fine-grained strategy using GuidedSelection Vectors to adaptively allocate expert numbers and ranks per layer, enhancing LoRA-MoE's performance across diverse tasks and models while maintaining efficiency.

Authors:Md. Adnanul Islam, Md. Faiyaz Abdullah Sayeedi, Md. Asaduzzaman Shuvo, Shahanur Rahman Bappy, Md Asiful Islam, Swakkhar Shatabda
Title: VisText-Mosquito: A Unified Multimodal Benchmark Dataset for Visual Detection, Segmentation, and Textual Reasoning on Mosquito Breeding Sites
Abstract:
Mosquito-borne diseases pose a major global health risk, requiring early detection and proactive control of breeding sites to prevent outbreaks. In this paper, we present VisText-Mosquito, a multimodal dataset that integrates visual and textual data to support automated detection, segmentation, and reasoning for mosquito breeding site analysis. The dataset includes 1,828 annotated images for object detection, 142 images for water surface segmentation, and natural language reasoning texts linked to each image. The YOLOv9s model achieves the highest precision of 0.92926 and mAP@50 of 0.92891 for object detection, while YOLOv11n-Seg reaches a segmentation precision of 0.91587 and mAP@50 of 0.79795. For reasoning generation, we tested a range of large vision-language models (LVLMs) in both zero-shot and few-shot settings. Our fine-tuned Mosquito-LLaMA3-8B model achieved the best results, with a final loss of 0.0028, a BLEU score of 54.7, BERTScore of 0.91, and ROUGE-L of 0.85. This dataset and model framework emphasize the theme "Prevention is Better than Cure", showcasing how AI-based detection can proactively address mosquito-borne disease risks. The dataset and implementation code are publicly available at GitHub: https://github.com/adnanul-islam-jisun/VisText-Mosquito
中文: 本文介绍了VisText-Mosquito多模态数据集,它整合视觉与文本数据以改进蚊虫孳生地的自动检测、分割和推理,先进模型实现了高精度并支持"预防胜于治疗"的主动疾病防控理念。
English: This paper introduces VisText-Mosquito, a multimodal dataset combining visual and textual data to enhance automated detection, segmentation, and reasoning for mosquito breeding sites, with advanced models achieving high precision and supporting proactive disease prevention.

Authors:Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud
Title: Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees
Abstract:
The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.

Authors:David Wan, Eran Hirsch, Elias Stengel-Eskin, Ido Dagan, Mohit Bansal
Title: GenerationPrograms: Fine-grained Attribution with Executable Programs
Abstract:
Recent large language models (LLMs) achieve impressive performance in source-conditioned text generation but often fail to correctly provide fine-grained attributions for their outputs, undermining verifiability and trust. Moreover, existing attribution methods do not explain how and why models leverage the provided source documents to generate their final responses, limiting interpretability. To overcome these challenges, we introduce a modular generation framework, GenerationPrograms, inspired by recent advancements in executable "code agent" architectures. Unlike conventional generation methods that simultaneously generate outputs and attributions or rely on post-hoc attribution, GenerationPrograms decomposes the process into two distinct stages: first, creating an executable program plan composed of modular text operations (such as paraphrasing, compression, and fusion) explicitly tailored to the query, and second, executing these operations following the program's specified instructions to produce the final response. Empirical evaluations demonstrate that GenerationPrograms significantly improves attribution quality at both the document level and sentence level across two long-form question-answering tasks and a multi-document summarization task. We further demonstrate that GenerationPrograms can effectively function as a post-hoc attribution method, outperforming traditional techniques in recovering accurate attributions. In addition, the interpretable programs generated by GenerationPrograms enable localized refinement through modular-level improvements that further enhance overall attribution quality.
中文:GenerationPrograms作为模块化生成框架,通过将文本生成分解为程序规划与执行两个阶段,在多项任务中显著提升了归因准确性和可解释性。
English: GenerationPrograms is a modular framework that decomposes text generation into program planning and execution stages, significantly improving attribution accuracy and interpretability across various tasks.

Authors:Mingkang Zhu, Xi Chen, Zhongdao Wang, Bei Yu, Hengshuang Zhao, Jiaya Jia
Title: TGDPO: Harnessing Token-Level Reward Guidance for Enhancing Direct Preference Optimization
Abstract:
Recent advancements in reinforcement learning from human feedback have shown that utilizing fine-grained token-level reward models can substantially enhance the performance of Proximal Policy Optimization (PPO) in aligning large language models. However, it is challenging to leverage such token-level reward as guidance for Direct Preference Optimization (DPO), since DPO is formulated as a sequence-level bandit problem. To address this challenge, this work decomposes the sequence-level PPO into a sequence of token-level proximal policy optimization problems and then frames the problem of token-level PPO with token-level reward guidance, from which closed-form optimal token-level policy and the corresponding token-level reward can be derived. Using the obtained reward and Bradley-Terry model, this work establishes a framework of computable loss functions with token-level reward guidance for DPO, and proposes a practical reward guidance based on the induced DPO reward. This formulation enables different tokens to exhibit varying degrees of deviation from reference policy based on their respective rewards. Experiment results demonstrate that our method achieves substantial performance improvements over DPO, with win rate gains of up to 7.5 points on MT-Bench, 6.2 points on AlpacaEval 2, and 4.3 points on Arena-Hard. Code is available at https://github.com/dvlab-research/TGDPO.
Chinese: 本研究通过将序列级PPO分解为令牌级问题,提出了一种将令牌级奖励指导融入直接偏好优化的方法,在多个基准测试中相比标准DPO实现了显著性能提升。
English: This study introduces a method to integrate token-level reward guidance into Direct Preference Optimization by decomposing sequence-level PPO into token-level problems, achieving significant performance gains over standard DPO across multiple benchmarks.

Authors:Di He, Ajay Jaiswal, Songjun Tu, Li Shen, Ganzhao Yuan, Shiwei Liu, Lu Yin
Title: AlphaDecay: Module-wise Weight Decay for Heavy-Tailed Balancing in LLMs
Abstract:
Weight decay is a standard regularization technique for training large language models (LLMs). While it is common to assign a uniform decay rate to every layer, this approach overlooks the structural diversity of LLMs and the varying spectral properties across modules. In this paper, we introduce AlphaDecay, a simple yet effective method that adaptively assigns different weight decay strengths to each module of an LLM. Our approach is guided by Heavy-Tailed Self-Regularization (HT-SR) theory, which analyzes the empirical spectral density (ESD) of weight correlation matrices to quantify "heavy-tailedness." Modules exhibiting more pronounced heavy-tailed ESDs, reflecting stronger feature learning, are assigned weaker decay, while modules with lighter-tailed spectra receive stronger decay. Our method leverages tailored weight decay assignments to balance the module-wise differences in spectral properties, leading to improved performance. Extensive pre-training tasks with various model sizes from 60M to 1B demonstrate that AlphaDecay achieves better perplexity and generalization than conventional uniform decay and other adaptive decay baselines. Our code is available at https://github.com/hed-ucas/AlphaDecay.
中文摘要:AlphaDecay是一种自适应权重衰减方法,根据大语言模型各模块的光谱特性分配不同的衰减强度,相比传统均匀衰减方法能有效提升模型性能。
English Summary: AlphaDecay is an adaptive weight decay method that assigns varying decay strengths to different modules of large language models based on their spectral properties, improving performance over uniform decay approaches.

Authors:Xiaoran Liu, Zhigeng Liu, Zengfeng Huang, Qipeng Guo, Ziwei He, Xipeng Qiu
Title: LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs
Abstract:
Large Language Diffusion Models, or diffusion LLMs, have emerged as a significant focus in NLP research, with substantial effort directed toward understanding their scalability and downstream task performance. However, their long-context capabilities remain unexplored, lacking systematic analysis or methods for context extension. In this work, we present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs. We first identify a unique characteristic of diffusion LLMs, unlike auto-regressive LLMs, they maintain remarkably stable perplexity during direct context extrapolation. Moreover, where auto-regressive models fail outright during the Needle-In-A-Haystack task with context exceeding their pretrained length, we discover diffusion LLMs exhibit a distinct local perception phenomenon, enabling successful retrieval from recent context segments. We explain both phenomena through the lens of Rotary Position Embedding (RoPE) scaling theory. Building on these observations, we propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation. Our results validate that established extrapolation scaling laws remain effective for extending the context windows of diffusion LLMs. Furthermore, we identify long-context tasks where diffusion LLMs outperform auto-regressive LLMs and others where they fall short. Consequently, this study establishes the first length extrapolation method for diffusion LLMs while providing essential theoretical insights and empirical benchmarks critical for advancing future research on long-context diffusion LLMs. The code is available at https://github.com/OpenMOSS/LongLLaDA.
中文: 本研究首次系统分析了扩散大语言模型的长上下文能力,揭示了其在上下文外推中保持稳定困惑度的特性及独特的局部感知现象,并提出无需训练的LongLLaDA方法,验证了扩展上下文窗口的有效缩放规律。
English: This study conducts the first systematic analysis of long-context capabilities in diffusion LLMs, revealing their stable perplexity during context extrapolation and unique local perception phenomenon, while proposing LongLLaDA as an effective training-free method for context extension with validated scaling laws.

Authors:Md Tanzib Hosain, Salman Rahman, Md Kishor Morol, Md Rizwan Parvez
Title: Xolver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team
Abstract:
Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation - treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers - such as Olympiad or programming contest teams - leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. We introduce Xolver, a training-free multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. Xolver integrates diverse experience modalities, including external and self-retrieval, tool use, collaborative interactions, agent-driven evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, Xolver avoids generating solutions from scratch - marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, Xolver consistently outperforms specialized reasoning agents. Even with lightweight backbones (e.g., QWQ-32B), it often surpasses advanced models including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With o3-mini-high, it achieves new best results on GSM8K (98.1%), AIME'24 (94.4%), AIME'25 (93.7%), Math-500 (99.8%), and LiveCodeBench-V5 (91.6%) - highlighting holistic experience learning as a key step toward generalist agents capable of expert-level reasoning. Code and data are available at https://kagnlp.github.io/xolver.github.io/.

Authors:Jia-Chen Zhang, Zheng Zhou, Yu-Jie Xiong, Chun-Ming Xia, Fei Dai
Title: CausalDiffTab: Mixed-Type Causal-Aware Diffusion for Tabular Data Generation
Abstract:
Training data has been proven to be one of the most critical components in training generative AI. However, obtaining high-quality data remains challenging, with data privacy issues presenting a significant hurdle. To address the need for high-quality data. Synthesize data has emerged as a mainstream solution, demonstrating impressive performance in areas such as images, audio, and video. Generating mixed-type data, especially high-quality tabular data, still faces significant challenges. These primarily include its inherent heterogeneous data types, complex inter-variable relationships, and intricate column-wise distributions. In this paper, we introduce CausalDiffTab, a diffusion model-based generative model specifically designed to handle mixed tabular data containing both numerical and categorical features, while being more flexible in capturing complex interactions among variables. We further propose a hybrid adaptive causal regularization method based on the principle of Hierarchical Prior Fusion. This approach adaptively controls the weight of causal regularization, enhancing the model's performance without compromising its generative capabilities. Comprehensive experiments conducted on seven datasets demonstrate that CausalDiffTab outperforms baseline methods across all metrics. Our code is publicly available at: https://github.com/Godz-z/CausalDiffTab.
Chinese: 训练生成式AI高度依赖高质量数据,而CausalDiffTab作为一种采用自适应因果正则化的扩散模型,通过处理异构数据类型和复杂变量关系,有效生成混合表格数据,在全面实验中表现优于基准方法。
English: Training generative AI heavily relies on high-quality data, and CausalDiffTab, a diffusion model with adaptive causal regularization, effectively generates mixed tabular data by addressing its heterogeneity and complex relationships, outperforming baselines in comprehensive experiments.

Authors:Jingxu Xie, Dylan Xu, Xuandong Zhao, Dawn Song
Title: AgentSynth: Scalable Task Generation for Generalist Computer-Use Agents
Abstract:
We introduce AgentSynth, a scalable and cost-efficient pipeline for automatically synthesizing high-quality tasks and trajectory datasets for generalist computer-use agents. Leveraging information asymmetry, AgentSynth constructs subtasks that are simple during generation but significantly more challenging when composed into long-horizon tasks, enabling the creation of over 6,000 diverse and realistic tasks. Our pipeline begins with an LLM-based task proposer guided by a persona, followed by an execution agent that completes the task and logs the trajectory. This process is repeated iteratively to form a sequence of subtasks, which are then summarized by a separate agent into a composite task of controllable difficulty. A key strength of AgentSynth is its ability to precisely modulate task complexity by varying the number of subtasks. Empirical evaluations show that state-of-the-art LLM agents suffer a steep performance drop, from 18% success at difficulty level 1 to just 4% at level 6, highlighting the benchmark's difficulty and discriminative power. Moreover, our pipeline achieves a low average cost of \$0.60 per trajectory, orders of magnitude cheaper than human annotations. Our code and data are publicly available at https://github.com/sunblaze-ucb/AgentSynth
中文:AgentSynth是一种可扩展且成本高效的流程,通过利用信息不对称和迭代子任务组合,自动为通用计算机使用代理生成多样化的真实任务,每条轨迹成本仅0.60美元,同时能显著挑战最先进的大语言模型代理——其任务成功率随难度提升从18%骤降至4%。
English: AgentSynth is a scalable and cost-effective pipeline that automatically generates diverse and realistic tasks for computer-use agents by leveraging information asymmetry and iterative subtask composition, achieving a low cost of $0.60 per trajectory while significantly challenging state-of-the-art LLM agents with performance dropping from 18% to 4% as task difficulty increases.

Authors:Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh
Title: Sampling from Your Language Model One Byte at a Time
Abstract:
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model's generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals. Code is available at https://github.com/SewoongLab/byte-sampler .
中文: 本文提出了一种推理时方法,可将采用BPE分词器的自回归语言模型转换为字符级或字节级模型,有效解决了提示边界问题,并实现了不同分词器模型间的集成与代理调优。
English: This paper introduces an inference-time method that converts autoregressive language models with BPE tokenizers into character-level or byte-level models, effectively solving the Prompt Boundary Problem and enabling model ensembling and proxy-tuning across different tokenizers.

Authors:Shiting Huang, Zhen Fang, Zehui Chen, Siyu Yuan, Junjie Ye, Yu Zeng, Lin Chen, Qi Mao, Feng Zhao
Title: CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
Abstract:
The ability of large language models (LLMs) to utilize external tools has enabled them to tackle an increasingly diverse range of tasks. However, as the tasks become more complex and long-horizon, the intricate tool utilization process may trigger various unexpected errors. Therefore, how to effectively handle such errors, including identifying, diagnosing, and recovering from them, has emerged as a key research direction for advancing tool learning. In this work, we first extensively analyze the types of errors encountered during the function-calling process on several competitive tool evaluation benchmarks. Based on it, we introduce CRITICTOOL, a comprehensive critique evaluation benchmark specialized for tool learning. Building upon a novel evolutionary strategy for dataset construction, CRITICTOOL holds diverse tool-use errors with varying complexities, which better reflects real-world scenarios. We conduct extensive experiments on CRITICTOOL, and validate the generalization and effectiveness of our constructed benchmark strategy. We also provide an in-depth analysis of the tool reflection ability on various LLMs, offering a new perspective on the field of tool learning in LLMs. The code is available at \href{https://github.com/Shellorley0513/CriticTool}{https://github.com/Shellorley0513/CriticTool}.
中文: 随着大语言模型在复杂任务中使用工具时频繁出错,CRITICTOOL基准应运而生,它通过评估模型的错误处理与反思能力,为工具学习领域提供了新的研究方向。
English: Large language models' growing use of tools for complex tasks introduces various errors, prompting the development of CRITICTOOL, a benchmark that evaluates error handling and reflection abilities to advance tool learning.

Authors:Ryuki Matsuura, Shikhar Bharadwaj, Jiarui Liu, Dhatchi Kunde Govindarajan
Title: EmoNews: A Spoken Dialogue System for Expressive News Conversations
Abstract:
We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues to enable more empathetic news conversations. Despite advancements in emotional text-to-speech (TTS) techniques, task-oriented emotional SDSs remain underexplored due to the compartmentalized nature of SDS and emotional TTS research, as well as the lack of standardized evaluation metrics for social goals. We address these challenges by developing an emotional SDS for news conversations that utilizes a large language model (LLM)-based sentiment analyzer to identify appropriate emotions and PromptTTS to synthesize context-appropriate emotional speech. We also propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems. Experiments showed that our emotional SDS outperformed a baseline system in terms of the emotion regulation and engagement. These results suggest the critical role of speech emotion for more engaging conversations. All our source code is open-sourced at https://github.com/dhatchi711/espnet-emotional-news/tree/emo-sds/egs2/emo_news_sds/sds1
中文: 本研究开发了一种面向任务的口语对话系统,通过情感分析和情境化语音合成技术提升新闻对话的共情能力,实验证明其在情感调节和用户参与度方面优于基线系统,并建立了相应的主观评价标准。
English: This study introduces a task-oriented spoken dialogue system that leverages sentiment analysis and emotional speech synthesis to enhance empathetic engagement in news conversations, demonstrating superior emotion regulation and user engagement compared to baseline systems through proposed evaluation metrics.

Authors:Junyan Li, Wenshuo Zhao, Yang Zhang, Chuang Gan
Title: Steering LLM Thinking with Budget Guidance
Abstract:
Recent deep-thinking large language models often reason extensively to improve performance, but such lengthy reasoning is not always desirable, as it incurs excessive inference costs with disproportionate performance gains. Controlling reasoning length without sacrificing performance is therefore important, but remains challenging, especially under tight thinking budgets. We propose budget guidance, a simple yet effective method for steering the reasoning process of LLMs toward a target budget without requiring any LLM fine-tuning. Our approach introduces a lightweight predictor that models a Gamma distribution over the remaining thinking length during next-token generation. This signal is then used to guide generation in a soft, token-level manner, ensuring that the overall reasoning trace adheres to the specified thinking budget. Budget guidance enables natural control of the thinking length, along with significant token efficiency improvements over baseline methods on challenging math benchmarks. For instance, it achieves up to a 26% accuracy gain on the MATH-500 benchmark under tight budgets compared to baseline methods, while maintaining competitive accuracy with only 63% of the thinking tokens used by the full-thinking model. Budget guidance also generalizes to broader task domains and exhibits emergent capabilities, such as estimating question difficulty. The source code is available at: https://github.com/UMass-Embodied-AGI/BudgetGuidance.
中文摘要:预算引导是一种无需微调即可有效控制大语言模型推理长度的方法,在严格预算下显著提升令牌效率与任务性能。
English Summary: Budget guidance is a novel method that enables large language models to control reasoning length effectively without fine-tuning, achieving significant token efficiency and performance gains under tight budgets.

Authors:Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Title: Attribution-guided Pruning for Compression, Circuit Discovery, and Targeted Correction in LLMs
Abstract:
Large Language Models (LLMs) are central to many contemporary AI applications, yet their extensive parameter counts pose significant challenges for deployment in memory- and compute-constrained environments. Recent works in eXplainable AI (XAI), particularly on attribution methods, suggest that interpretability can also enable model compression by identifying and removing components irrelevant to inference. In this paper, we leverage Layer-wise Relevance Propagation (LRP) to perform attribution-guided pruning of LLMs. While LRP has shown promise in structured pruning for vision models, we extend it to unstructured pruning in LLMs and demonstrate that it can substantially reduce model size with minimal performance loss. Our method is especially effective in extracting task-relevant subgraphs -- so-called ``circuits'' -- which can represent core functions (e.g., indirect object identification). Building on this, we introduce a technique for model correction, by selectively removing circuits responsible for spurious behaviors (e.g., toxic outputs). All in all, we gather these techniques as a uniform holistic framework and showcase its effectiveness and limitations through extensive experiments for compression, circuit discovery and model correction on Llama and OPT models, highlighting its potential for improving both model efficiency and safety. Our code is publicly available at https://github.com/erfanhatefi/SparC3.
中文摘要:本文提出了一个利用层级相关性传播进行归因引导剪枝的统一框架,能够在大幅压缩大语言模型规模的同时保持性能,并实现核心功能电路发现和模型安全修正。
English Summary: This paper presents a holistic framework using Layer-wise Relevance Propagation for attribution-guided pruning of Large Language Models, enabling efficient model compression, circuit discovery, and safety correction while maintaining performance.

Authors:Shaolei Zhang, Shoutao Guo, Qingkai Fang, Yan Zhou, Yang Feng
Title: Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model
Abstract:
The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a large language model (LLM) backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.
中文: 本文提出Stream-Omni大语言-视觉-语音模型,通过序列维度拼接实现视觉-文本对齐、基于CTC的层维度映射实现语音-文本对齐,以更少数据需求在多模态任务中实现优异性能。
English: This paper introduces Stream-Omni, a large language-vision-speech model that achieves efficient modality alignments through tailored methods—sequence-dimension concatenation for vision-text and CTC-based mapping for speech-text—enabling strong performance across multimodal tasks with reduced data requirements.

Authors:Bohao Yang, Hainiu Xu, Jinhua Du, Ze Li, Yulan He, Chenghua Lin
Title: EvolvTrip: Enhancing Literary Character Understanding with Temporal Theory-of-Mind Graphs
Abstract:
A compelling portrayal of characters is essential to the success of narrative writing. For readers, appreciating a character's traits requires the ability to infer their evolving beliefs, desires, and intentions over the course of a complex storyline, a cognitive skill known as Theory-of-Mind (ToM). Performing ToM reasoning in prolonged narratives requires readers to integrate historical context with current narrative information, a task at which humans excel but Large Language Models (LLMs) often struggle. To systematically evaluate LLMs' ToM reasoning capability in long narratives, we construct LitCharToM, a benchmark of character-centric questions across four ToM dimensions from classic literature. Further, we introduce EvolvTrip, a perspective-aware temporal knowledge graph that tracks psychological development throughout narratives. Our experiments demonstrate that EvolvTrip consistently enhances performance of LLMs across varying scales, even in challenging extended-context scenarios. EvolvTrip proves to be particularly valuable for smaller models, partially bridging the performance gap with larger LLMs and showing great compatibility with lengthy narratives. Our findings highlight the importance of explicit representation of temporal character mental states in narrative comprehension and offer a foundation for more sophisticated character understanding. Our data and code are publicly available at https://github.com/Bernard-Yang/EvolvTrip.
Chinese: 本研究引入LitCharToM基准和EvolvTrip知识图谱来提升大语言模型在长篇叙事中的心理理论推理能力,实验表明明确追踪角色心理状态能显著增强模型表现,尤其对小型模型效果更为明显。
English: The study introduces LitCharToM and EvolvTrip to enhance LLMs' Theory-of-Mind reasoning in long narratives, showing that explicit tracking of character mental states significantly improves model performance, especially for smaller models.

Authors:MiniMax, :, Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, Chengjun Xiao, Chengyu Du, Chi Zhang, Chu Qiao, Chunhao Zhang, Chunhui Du, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dong Li, Enwei Jiao, Haigang Zhou, Haimo Zhang, Han Ding, Haohai Sun, Haoyu Feng, Huaiguang Cai, Haichao Zhu, Jian Sun, Jiaqi Zhuang, Jiaren Cai, Jiayuan Song, Jin Zhu, Jingyang Li, Jinhao Tian, Jinli Liu, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kaiyi Feng, Ke Yang, Kecheng Xiao, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Li, Lin Zheng, Linge Du, Lingyu Yang, Lunbin Zeng, Minghui Yu, Mingliang Tao, Mingyuan Chi, Mozhi Zhang, Mujie Lin, Nan Hu, Nongyu Di, Peng Gao, Pengfei Li, Pengyu Zhao, Qibing Ren, Qidi Xu, Qile Li, Qin Wang, Rong Tian, Ruitao Leng, Shaoxiang Chen, Shaoyu Chen, Shengmin Shi, Shitong Weng, Shuchang Guan, Shuqi Yu, Sichen Li, Songquan Zhu, Tengfei Li, Tianchi Cai, Tianrun Liang, Weiyu Cheng, Weize Kong, Wenkai Li, Xiancai Chen, Xiangjun Song, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xinzhu Hou, Xuan Lu, Xun Zou, Xuyang Shen, Yan Gong, Yan Ma, Yang Wang, Yiqi Shi, Yiran Zhong, Yonghong Duan, Yongxiang Fu, Yongyi Hu, Yu Gao, Yuanxiang Fan, Yufeng Yang, Yuhao Li, Yulin Hu, Yunan Huang, Yunji Li, Yunzhi Xu, Yuxin Mao, Yuxuan Shi, Yuze Wenren, Zehan Li, Zelin Li, Zhanxu Tian, Zhengmao Zhu, Zhenhua Fan, Zhenzhen Wu, Zhichao Xu, Zhihang Yu, Zhiheng Lyu, Zhuo Jiang, Zibo Gao, Zijia Wu, Zijian Song, Zijun Sun
Title: MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Abstract:
We introduce MiniMax-M1, the world's first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model, which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems including sandbox-based, real-world software engineering environments. In addition to M1's inherent efficiency advantage for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. We publicly release MiniMax-M1 at https://github.com/MiniMax-AI/MiniMax-M1.
MiniMax-M1是全球首个开放权重的混合注意力推理模型,具备百万令牌上下文和闪电注意力机制,通过仅需三周的高效强化学习在复杂任务中实现卓越性能。
MiniMax-M1 is the world's first open-weight hybrid-attention reasoning model featuring a million-token context and lightning attention mechanism, achieving superior performance in complex tasks through efficient reinforcement learning completed in just three weeks.

Authors:José A. Pardo, Alicia Gómez-Pascual, José T. Palma, Juan A. Botía
Title: Enhancing Omics Cohort Discovery for Research on Neurodegeneration through Ontology-Augmented Embedding Models
Abstract:
The growing volume of omics and clinical data generated for neurodegenerative diseases (NDs) requires new approaches for their curation so they can be ready-to-use in bioinformatics. NeuroEmbed is an approach for the engineering of semantically accurate embedding spaces to represent cohorts and samples. The NeuroEmbed method comprises four stages: (1) extraction of ND cohorts from public repositories; (2) semi-automated normalization and augmentation of metadata of cohorts and samples using biomedical ontologies and clustering on the embedding space; (3) automated generation of a natural language question-answering (QA) dataset for cohorts and samples based on randomized combinations of standardized metadata dimensions and (4) fine-tuning of a domain-specific embedder to optimize queries. We illustrate the approach using the GEO repository and the PubMedBERT pretrained embedder. Applying NeuroEmbed, we semantically indexed 2,801 repositories and 150,924 samples. Amongst many biology-relevant categories, we normalized more than 1,700 heterogeneous tissue labels from GEO into 326 unique ontology-aligned concepts and enriched annotations with new ontology-aligned terms, leading to a fold increase in size for the metadata terms between 2.7 and 20 fold. After fine-tuning PubMedBERT with the QA training data augmented with the enlarged metadata, the model increased its mean Retrieval Precision from 0.277 to 0.866 and its mean Percentile Rank from 0.355 to 0.896. The NeuroEmbed methodology for the creation of electronic catalogues of omics cohorts and samples will foster automated bioinformatic pipelines construction. The NeuroEmbed catalogue of cohorts and samples is available at https://github.com/JoseAdrian3/NeuroEmbed.
中文: NeuroEmbed是一种创新方法,通过本体对齐和优化嵌入器为神经退行性疾病队列及样本构建语义精准的嵌入空间,显著提升元数据标准化程度与检索精度,从而推动生物信息学流程的自动化构建。
English: NeuroEmbed is a novel method that creates semantically precise embedding spaces for neurodegenerative disease cohorts and samples, enhancing metadata normalization and retrieval precision through ontology alignment and fine-tuned embedders, thereby facilitating automated bioinformatics workflows.

Authors:Pengzuo Wu, Yuhang Yang, Guangcheng Zhu, Chao Ye, Hong Gu, Xu Lu, Ruixuan Xiao, Bowen Bao, Yijing He, Liangyu Zha, Wentao Ye, Junbo Zhao, Haobo Wang
Title: RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis
Abstract:
With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce RealHiTBench, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using 25 state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based pipeline that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs' perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at https://github.com/cspzyy/RealHiTBench.
中文: 本文提出RealHiTBench这一评估大语言模型处理复杂表格数据能力的挑战性基准,并开发了TreeThinker树状结构方法来提升模型对表格层次结构的感知能力。
English: This paper introduces RealHiTBench, a challenging benchmark for evaluating LLMs and MLLMs on complex tabular data across multiple formats, and proposes TreeThinker, a tree-based method to enhance table hierarchy perception.

Authors:Wooseok Seo, Seungju Han, Jaehun Jung, Benjamin Newman, Seungwon Lim, Seungbeen Lee, Ximing Lu, Yejin Choi, Youngjae Yu
Title: Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers
Abstract:
Fact verification is essential for ensuring the reliability of LLM applications. In this study, we evaluate 12 pre-trained LLMs and one specialized fact-verifier, including frontier LLMs and open-weight reasoning LLMs, using a collection of examples from 14 fact-checking benchmarks. We share three findings intended to guide future development of more robust fact verifiers. First, we highlight the importance of addressing annotation errors and ambiguity in datasets, demonstrating that approximately 16\% of ambiguous or incorrectly labeled data substantially influences model rankings. Neglecting this issue may result in misleading conclusions during comparative evaluations, and we suggest using a systematic pipeline utilizing LLM-as-a-judge to help identify these issues at scale. Second, we discover that frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance. We therefore recommend future studies include comparisons with these simple yet highly effective baselines. Lastly, despite their effectiveness, frontier LLMs incur substantial costs, motivating the development of small, fine-tuned fact verifiers. We show that these small models still have room for improvement, particularly on instances that require complex reasoning. Encouragingly, we demonstrate that augmenting training with synthetic multi-hop reasoning data significantly enhances their capabilities in such instances. We release our code, model, and dataset at https://github.com/just1nseo/verifying-the-verifiers
中文: 本研究评估了12个预训练大语言模型和一个专业事实核查器,发现解决数据集模糊性、利用少样本示例的前沿模型以及通过合成推理数据增强小型模型的能力,对构建可靠的事实核查系统至关重要。
English: This study evaluates 12 pre-trained LLMs and a specialized fact-verifier, revealing that addressing dataset ambiguities, leveraging frontier LLMs with few-shot examples, and enhancing small models with synthetic reasoning data are crucial for developing robust fact verification systems.

Authors:Zhongqian Fu, Ning Ding, Kai Han, Xianzhi Yu, Xiaosong Li, Xinghao Chen, Yehui Tang, Yunhe Wang
Title: EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization
Abstract:
Mixture-of-Experts (MoE) models have emerged as a cornerstone of large-scale deep learning by efficiently distributing computation and enhancing performance. However, their unique architecture-characterized by sparse expert activation and dynamic routing mechanisms-introduces inherent complexities that challenge conventional quantization techniques. Existing post-training quantization (PTQ) methods struggle to address activation outliers, router consistency and sparse expert calibration, leading to significant performance degradation. To bridge this gap, we propose EAQuant, a novel PTQ framework tailored for MoE architectures. Our method systematically tackles these challenges through three key innovations: (1) expert-aware smoothing aggregation to suppress activation outliers and stabilize quantization, (2) router logits distribution alignment to preserve expert selection consistency post-quantization, and (3) expert-level calibration data balance to optimize sparsely activated experts. Extensive experiments across W4A4 and extreme W3A4 quantization configurations demonstrate that EAQuant significantly outperforms existing methods, achieving average score improvements of 1.15 - 2.28% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression. Our code is available at https://github.com/darren-fzq1/EAQuant.
中文: EAQuant是一种专为专家混合模型设计的新型训练后量化框架,通过三项关键技术解决激活异常值、路由器一致性和专家校准问题,在各种量化配置下实现了最先进的性能提升。
English: EAQuant is a novel post-training quantization framework designed for Mixture-of-Experts models that addresses activation outliers, router consistency, and expert calibration through three key innovations, achieving state-of-the-art performance improvements across various quantization settings.

Authors:Huayang Li, Yahui Liu, Hongyu Sun, Deng Cai, Leyang Cui, Wei Bi, Peilin Zhao, Taro Watanabe
Title: SeqPE: Transformer with Sequential Position Encoding
Abstract:
Since self-attention layers in Transformers are permutation invariant by design, positional encodings must be explicitly incorporated to enable spatial understanding. However, fixed-size lookup tables used in traditional learnable position embeddings (PEs) limit extrapolation capabilities beyond pre-trained sequence lengths. Expert-designed methods such as ALiBi and RoPE, mitigate this limitation but demand extensive modifications for adapting to new modalities, underscoring fundamental challenges in adaptability and scalability. In this work, we present SeqPE, a unified and fully learnable position encoding framework that represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings in an end-to-end manner. To regularize SeqPE's embedding space, we introduce two complementary objectives: a contrastive objective that aligns embedding distances with a predefined position-distance function, and a knowledge distillation loss that anchors out-of-distribution position embeddings to in-distribution teacher representations, further enhancing extrapolation performance. Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM), and accuracy--particularly under context length extrapolation--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign. We release our code, data, and checkpoints at https://github.com/ghrua/seqpe.
中文摘要:SeqPE提出了一种统一且完全可学习的位置编码框架,通过符号序列表示位置索引并采用对比学习与知识蒸馏目标,显著提升了外推能力和多模态适应性。
English Summary: SeqPE introduces a unified, learnable position encoding framework that uses symbolic sequences and complementary objectives to enhance extrapolation and adaptability across various tasks and modalities.

Authors:Philipp Spohn, Leander Girrbach, Jessica Bader, Zeynep Akata
Title: Align-then-Unlearn: Embedding Alignment for LLM Unlearning
Abstract:
As large language models (LLMs) are trained on massive datasets, they have raised significant privacy and ethical concerns due to their potential to inadvertently retain sensitive information. Unlearning seeks to selectively remove specific data from trained models, such as personal information or copyrighted content. Current approaches targeting specific output sequences at the token level often fail to achieve complete forgetting and remain susceptible to prompt rephrasing. We propose Align-then-Unlearn, a novel framework that performs unlearning in the semantic embedding space rather than directly on output tokens. Align-then-Unlearn first augments the LLM with an embedding prediction module trained to anticipate future context representations. Unlearning is then achieved by fine-tuning the model to minimize the similarity between these predicted embeddings and a target embedding that represents the concept to be removed. Initial results show that Align-then-Unlearn effectively removes targeted knowledge with minimal degradation in overall model utility. These findings suggest that embedding-based unlearning offers a promising and robust approach to removing conceptual knowledge. Our code is available at https://github.com/ExplainableML/align-then-unlearn.
Chinese: Align-then-Unlearn框架通过在语义嵌入空间执行遗忘操作,有效移除大型语言模型中的特定知识,同时保持模型整体性能,为解决隐私问题提供了新途径。
English: The Align-then-Unlearn framework addresses privacy concerns in large language models by performing unlearning in the semantic embedding space, effectively removing targeted knowledge while preserving overall model utility.

Authors:Coleman Hooper, Sebastian Zhao, Luca Manolache, Sehoon Kim, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
Title: Multipole Attention for Efficient Long Context Reasoning
Abstract:
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Additionally, prior methods often pre-process the input to make it easier to identify the important prompt tokens when computing attention during generation, and this pre-processing is challenging to perform online for newly generated reasoning tokens. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. We design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4.5$\times$ speedup for attention in long-context reasoning applications. Our code is available at https://github.com/SqueezeAILab/MultipoleAttention.
Chinese: 提出的多极注意力方法通过仅对关键标记计算精确注意力并对其他标记进行聚类近似,加速大型推理模型的自回归推理过程,在保持精度的同时实现了长上下文应用中高达4.5倍的注意力加速。
English: The proposed Multipole Attention method accelerates autoregressive reasoning in Large Reasoning Models by computing exact attention only for key tokens and approximating others through clustering, maintaining accuracy while achieving up to 4.5× speedup in long-context applications.

Authors:Can Polat, Hasan Kurban, Erchin Serpedin, Mustafa Kurban
Title: Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning
Abstract:
Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision--language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and code are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.
中文摘要:本研究提出一个多尺度多晶体数据集及两项基于物理原理的评估基准,通过空间排除和成分排除协议,系统测试多模态基础模型在晶体学推理中的泛化能力、物理一致性及可靠性。
English Summary: This study introduces a multiscale multicrystal dataset with two physics-grounded benchmarks to evaluate multimodal foundation models' crystallographic reasoning by testing their generalization, physical consistency, and reliability through structured exclusion protocols.

Authors:Naihao Deng, Kapotaksha Das, Rada Mihalcea, Vitaliy Popov, Mohamed Abouelenien
Title: CliniDial: A Naturally Occurring Multimodal Dialogue Dataset for Team Reflection in Action During Clinical Operation
Abstract:
In clinical operations, teamwork can be the crucial factor that determines the final outcome. Prior studies have shown that sufficient collaboration is the key factor that determines the outcome of an operation. To understand how the team practices teamwork during the operation, we collected CliniDial from simulations of medical operations. CliniDial includes the audio data and its transcriptions, the simulated physiology signals of the patient manikins, and how the team operates from two camera angles. We annotate behavior codes following an existing framework to understand the teamwork process for CliniDial. We pinpoint three main characteristics of our dataset, including its label imbalances, rich and natural interactions, and multiple modalities, and conduct experiments to test existing LLMs' capabilities on handling data with these characteristics. Experimental results show that CliniDial poses significant challenges to the existing models, inviting future effort on developing methods that can deal with real-world clinical data. We open-source the codebase at https://github.com/MichiganNLP/CliniDial
中文: 本研究介绍了来自医疗模拟的多模态数据集CliniDial,其标签不平衡、互动丰富且多模态的特点对现有大语言模型构成挑战,呼吁开发新方法以处理真实临床数据。
English: The study introduces CliniDial, a multimodal dataset from medical simulations, highlighting its label imbalances, rich interactions, and multiple modalities, which challenge existing LLMs and call for new methods to handle real-world clinical data.

Authors:Xingjian Diao, Chunhui Zhang, Keyi Kong, Weiyi Wu, Chiyu Ma, Zhongyu Ouyang, Peijun Qing, Soroush Vosoughi, Jiang Gui
Title: SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models
Abstract:
While large language models have demonstrated impressive reasoning abilities, their extension to the audio modality, particularly within large audio-language models (LALMs), remains underexplored. Addressing this gap requires a systematic approach that involves a capable base model, high-quality reasoning-oriented audio data, and effective training algorithms. In this work, we present a comprehensive solution for audio logical reasoning (ALR) tasks: we introduce SoundMind, a dataset of 6,446 audio-text annotated samples specifically curated to support complex reasoning. Building on this resource, we propose SoundMind-RL, a rule-based reinforcement learning (RL) algorithm designed to equip audio-language models with robust audio-text reasoning capabilities. By fine-tuning Qwen2.5-Omni-7B on the proposed SoundMind dataset using SoundMind-RL, we achieve strong and consistent improvements over state-of-the-art baselines on the SoundMind benchmark. This work highlights the benefit of combining high-quality, reasoning-focused datasets with specialized RL techniques, and contributes to advancing auditory intelligence in language models. The code and dataset introduced in this work are publicly available at https://github.com/xid32/SoundMind.
中文摘要:本研究提出了SoundMind专用数据集和基于规则的强化学习算法SoundMind-RL,共同增强了音频语言模型的逻辑推理能力,相比现有方法取得了显著提升。
English Summary: This research introduces SoundMind, a specialized dataset and a rule-based reinforcement learning algorithm called SoundMind-RL, which together enhance audio-language models' logical reasoning capabilities, achieving significant improvements over existing methods.

Authors:William Xia, Ishita Unde, Brian Ondov, Dina Demner-Fushman
Title: JEBS: A Fine-grained Biomedical Lexical Simplification Task
Abstract:
Online medical literature has made health information more available than ever, however, the barrier of complex medical jargon prevents the general public from understanding it. Though parallel and comparable corpora for Biomedical Text Simplification have been introduced, these conflate the many syntactic and lexical operations involved in simplification. To enable more targeted development and evaluation, we present a fine-grained lexical simplification task and dataset, Jargon Explanations for Biomedical Simplification (JEBS, https://github.com/bill-from-ri/JEBS-data ). The JEBS task involves identifying complex terms, classifying how to replace them, and generating replacement text. The JEBS dataset contains 21,595 replacements for 10,314 terms across 400 biomedical abstracts and their manually simplified versions. Additionally, we provide baseline results for a variety of rule-based and transformer-based systems for the three sub-tasks. The JEBS task, data, and baseline results pave the way for development and rigorous evaluation of systems for replacing or explaining complex biomedical terms.
中文摘要:JEBS数据集通过提供细粒度词汇简化任务,识别、分类并生成医学术语的替换文本,解决了在线医学文献中复杂术语阻碍公众理解的问题,为开发精准的术语替换系统奠定了基础。
English Summary: The JEBS dataset addresses the challenge of simplifying complex medical jargon in online literature by providing a fine-grained lexical simplification task that identifies, classifies, and generates replacements for technical terms to improve public comprehension.

Authors:Larissa Mori, Carlos Sousa de Oliveira, Yuehwern Yih, Mario Ventresca
Title: Assessing the Performance Gap Between Lexical and Semantic Models for Information Retrieval With Formulaic Legal Language
Abstract:
Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and quantitative analysis with three complementary metrics shows that both lexical and dense models perform well in scenarios with more repetitive usage of language, whereas BM25 performs better than the dense models in more nuanced scenarios where repetition and verbatim~quotes are less prevalent and in longer queries. Our experiments also show that BM25 is a strong baseline, surpassing off-the-shelf dense models in 4 out of 7 performance metrics. However, fine-tuning a dense model on domain-specific data led to improved performance, surpassing BM25 in most metrics, and we analyze the effect of the amount of data used in fine-tuning on the model's performance and temporal robustness. The code, dataset and appendix related to this work are available on: https://github.com/larimo/lexsem-legal-ir.
中文摘要:本研究探讨了词汇模型与语义模型在检索欧盟法院重复性法律段落时的效能,发现经过领域数据微调的语义模型在多数指标上优于传统BM25方法。
English Summary: This study examines the effectiveness of lexical versus semantic models for retrieving repetitive legal passages from CJEU decisions, finding that fine-tuned dense models generally outperform traditional BM25 methods in most metrics.

Authors:Xinyuan Xia, Yuanyi Song, Haomin Ma, Jinyu Cai
Title: WereWolf-Plus: An Update of Werewolf Game setting Based on DSGBench
Abstract:
With the rapid development of LLM-based agents, increasing attention has been given to their social interaction and strategic reasoning capabilities. However, existing Werewolf-based benchmarking platforms suffer from overly simplified game settings, incomplete evaluation metrics, and poor scalability. To address these limitations, we propose WereWolf-Plus, a multi-model, multi-dimensional, and multi-method benchmarking platform for evaluating multi-agent strategic reasoning in the Werewolf game. The platform offers strong extensibility, supporting customizable configurations for roles such as Seer, Witch, Hunter, Guard, and Sheriff, along with flexible model assignment and reasoning enhancement strategies for different roles. In addition, we introduce a comprehensive set of quantitative evaluation metrics for all special roles, werewolves, and the sheriff, and enrich the assessment dimensions for agent reasoning ability, cooperation capacity, and social influence. WereWolf-Plus provides a more flexible and reliable environment for advancing research on inference and strategic interaction within multi-agent communities. Our code is open sourced at https://github.com/MinstrelsyXia/WereWolfPlus.
中文: WereWolf-Plus 是一个先进的基准测试平台,通过提供可定制的角色、灵活的模型分配和全面的评估指标,解决了现有狼人杀游戏评估的局限性,以评估多智能体的策略推理、合作能力及社会影响力。
English: WereWolf-Plus is an advanced benchmarking platform that addresses the limitations of existing Werewolf game evaluations by offering customizable roles, flexible model assignments, and comprehensive metrics to assess multi-agent strategic reasoning, cooperation, and social influence.

Authors:David Guzman Piedrahita, Irene Strauss, Bernhard Schölkopf, Rada Mihalcea, Zhijing Jin
Title: Democratic or Authoritarian? Probing a New Dimension of Political Biases in Large Language Models
Abstract:
As Large Language Models (LLMs) become increasingly integrated into everyday life and information ecosystems, concerns about their implicit biases continue to persist. While prior work has primarily examined socio-demographic and left--right political dimensions, little attention has been paid to how LLMs align with broader geopolitical value systems, particularly the democracy--authoritarianism spectrum. In this paper, we propose a novel methodology to assess such alignment, combining (1) the F-scale, a psychometric tool for measuring authoritarian tendencies, (2) FavScore, a newly introduced metric for evaluating model favorability toward world leaders, and (3) role-model probing to assess which figures are cited as general role-models by LLMs. We find that LLMs generally favor democratic values and leaders, but exhibit increases favorability toward authoritarian figures when prompted in Mandarin. Further, models are found to often cite authoritarian figures as role models, even outside explicit political contexts. These results shed light on ways LLMs may reflect and potentially reinforce global political ideologies, highlighting the importance of evaluating bias beyond conventional socio-political axes. Our code is available at: https://github.com/irenestrauss/Democratic-Authoritarian-Bias-LLMs
中文摘要:本研究提出了一种评估大型语言模型在民主与专制价值观上倾向的新方法,发现模型虽普遍倾向民主价值观,但中文提示下对专制人物好感度上升,且常将其视为榜样。
English Summary: This study introduces a new method to evaluate how large language models align with the democracy-authoritarianism spectrum, finding they generally favor democratic values but show increased preference for authoritarian figures when prompted in Mandarin and often cite them as role models.

Authors:Xiangyang Li, Xiaopeng Li, Kuicai Dong, Quanhu Zhang, Rongju Ruan, Xinyi Dai, Xiaoshuang Liu, Shengchun Xu, Yasheng Wang, Ruiming Tang
Title: Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?
Abstract:
Code generation is a core capability of large language models (LLMs), yet mainstream benchmarks (e.g., APPs and LiveCodeBench) contain questions with medium-level difficulty and pose no challenge to advanced LLMs. To better reflected the advanced reasoning and code generation ability, We introduce Humanity's Last Code Exam (HLCE), comprising 235 most challenging problems from the International Collegiate Programming Contest (ICPC World Finals) and the International Olympiad in Informatics (IOI) spanning 2010 - 2024. As part of HLCE, we design a harmonized online-offline sandbox that guarantees fully reproducible evaluation. Through our comprehensive evaluation, we observe that even the strongest reasoning LLMs: o4-mini(high) and Gemini-2.5 Pro, achieve pass@1 rates of only 15.9% and 11.4%, respectively. Meanwhile, we propose a novel "self-recognition" task to measure LLMs' awareness of their own capabilities. Results indicate that LLMs' self-recognition abilities are not proportionally correlated with their code generation performance. Finally, our empirical validation of test-time scaling laws reveals that current advanced LLMs have substantial room for improvement on complex programming tasks. We expect HLCE to become a milestone challenge for code generation and to catalyze advances in high-performance reasoning and human-AI collaborative programming. Our code and dataset are also public available(https://github.com/Humanity-s-Last-Code-Exam/HLCE).
中文总结:现有代码生成基准难以评估先进大语言模型,因此HLCE汇集了来自顶级编程竞赛的235道高难度题目,发现即使最强模型也表现不佳,揭示其在复杂编程任务上仍有巨大提升空间。
English Summary: Current code generation benchmarks fail to challenge advanced LLMs, so HLCE introduces 235 highly difficult problems from premier programming competitions, revealing that even top models achieve low success rates and have significant room for improvement.

Authors:Zain Muhammad Mujahid, Dilshod Azizov, Maha Tufail Agro, Preslav Nakov
Title: Profiling News Media for Factuality and Bias Using LLMs and the Fact-Checking Methodology of Human Experts
Abstract:
In an age characterized by the proliferation of mis- and disinformation online, it is critical to empower readers to understand the content they are reading. Important efforts in this direction rely on manual or automatic fact-checking, which can be challenging for emerging claims with limited information. Such scenarios can be handled by assessing the reliability and the political bias of the source of the claim, i.e., characterizing entire news outlets rather than individual claims or articles. This is an important but understudied research direction. While prior work has looked into linguistic and social contexts, we do not analyze individual articles or information in social media. Instead, we propose a novel methodology that emulates the criteria that professional fact-checkers use to assess the factuality and political bias of an entire outlet. Specifically, we design a variety of prompts based on these criteria and elicit responses from large language models (LLMs), which we aggregate to make predictions. In addition to demonstrating sizable improvements over strong baselines via extensive experiments with multiple LLMs, we provide an in-depth error analysis of the effect of media popularity and region on model performance. Further, we conduct an ablation study to highlight the key components of our dataset that contribute to these improvements. To facilitate future research, we released our dataset and code at https://github.com/mbzuai-nlp/llm-media-profiling.
中文: 本研究提出了一种利用大型语言模型评估整个新闻媒体可靠性和政治偏见的新方法,通过关注信息来源特征而非单个声明,改进了传统的核实方式。
English: This study introduces a novel method using large language models to assess the reliability and political bias of entire news outlets, improving upon traditional fact-checking by focusing on source characterization rather than individual claims.

Authors:Shuo Yang, Yuqin Dai, Guoqing Wang, Xinran Zheng, Jinfeng Xu, Jinze Li, Zhenzhe Ying, Weiqiang Wang, Edith C. H. Ngai
Title: RealFactBench: A Benchmark for Evaluating Large Language Models in Real-World Fact-Checking
Abstract:
Large Language Models (LLMs) hold significant potential for advancing fact-checking by leveraging their capabilities in reasoning, evidence retrieval, and explanation generation. However, existing benchmarks fail to comprehensively evaluate LLMs and Multimodal Large Language Models (MLLMs) in realistic misinformation scenarios. To bridge this gap, we introduce RealFactBench, a comprehensive benchmark designed to assess the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, including Knowledge Validation, Rumor Detection, and Event Verification. RealFactBench consists of 6K high-quality claims drawn from authoritative sources, encompassing multimodal content and diverse domains. Our evaluation framework further introduces the Unknown Rate (UnR) metric, enabling a more nuanced assessment of models' ability to handle uncertainty and balance between over-conservatism and over-confidence. Extensive experiments on 7 representative LLMs and 4 MLLMs reveal their limitations in real-world fact-checking and offer valuable insights for further research. RealFactBench is publicly available at https://github.com/kalendsyang/RealFactBench.git.
中文: RealFactBench是一个全面的基准测试,旨在评估大语言模型和多模态大语言模型在多样化现实任务中的事实核查能力,通过整合6000条高质量声明并引入未知率指标来弥补现有评估的不足,有效检验模型处理不确定性的表现。
English: RealFactBench is a comprehensive benchmark designed to evaluate the fact-checking capabilities of LLMs and MLLMs across diverse real-world tasks, addressing the limitations of existing evaluations by incorporating 6K high-quality claims and introducing the Unknown Rate metric to assess uncertainty handling.

Authors:Zhuocheng Zhang, Yang Feng, Min Zhang
Title: FlexRAG: A Flexible and Comprehensive Framework for Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) plays a pivotal role in modern large language model applications, with numerous existing frameworks offering a wide range of functionalities to facilitate the development of RAG systems. However, we have identified several persistent challenges in these frameworks, including difficulties in algorithm reproduction and sharing, lack of new techniques, and high system overhead. To address these limitations, we introduce \textbf{FlexRAG}, an open-source framework specifically designed for research and prototyping. FlexRAG supports text-based, multimodal, and network-based RAG, providing comprehensive lifecycle support alongside efficient asynchronous processing and persistent caching capabilities. By offering a robust and flexible solution, FlexRAG enables researchers to rapidly develop, deploy, and share advanced RAG systems. Our toolkit and resources are available at \href{https://github.com/ictnlp/FlexRAG}{https://github.com/ictnlp/FlexRAG}.
Chinese: FlexRAG 是一个开源框架,旨在解决现有 RAG 系统在算法重现、技术更新和系统开销方面的挑战,通过提供全生命周期支持与高效处理能力,助力研究人员快速开发和共享先进的 RAG 应用。
English: FlexRAG is an open-source framework designed to overcome the limitations of existing RAG systems, such as poor reproducibility and high overhead, by offering comprehensive lifecycle support and efficient processing for rapid development and sharing of advanced RAG applications.

Authors:Chong Li, Yingzhuo Deng, Jiajun Zhang, Chengqing Zong
Title: Group then Scale: Dynamic Mixture-of-Experts Multilingual Language Model
Abstract:
The curse of multilinguality phenomenon is a fundamental problem of multilingual Large Language Models (LLMs), where the competition between massive languages results in inferior performance. It mainly comes from limited capacity and negative transfer between dissimilar languages. To address this issue, we propose a method to dynamically group and scale up the parameters of multilingual LLM while boosting positive transfer among similar languages. Specifically, the model is first tuned on monolingual corpus to determine the parameter deviation in each layer and quantify the similarity between languages. Layers with more deviations are extended to mixture-of-experts layers to reduce competition between languages, where one expert module serves one group of similar languages. Experimental results on 18 to 128 languages show that our method reduces the negative transfer between languages and significantly boosts multilingual performance with fewer parameters. Such language group specialization on experts benefits the new language adaptation and reduces the inference on the previous multilingual knowledge learned.
中文: 多语言大模型中的“多语言诅咒”源于模型容量有限和语言间负迁移,通过动态分组语言并利用专家混合层扩展参数,有效减少了语言竞争,以更少参数显著提升了多语言性能。
English: The curse of multilinguality in LLMs, caused by limited capacity and negative transfer, is addressed by dynamically grouping languages and scaling parameters through mixture-of-experts layers, which reduces competition and enhances performance with fewer parameters.

Authors:Zichuan Fu, Xian Wu, Guojing Li, Yingying Zhang, Yefeng Zheng, Tianshi Ming, Yejing Wang, Wanyu Wang, Xiangyu Zhao
Title: Model Merging for Knowledge Editing
Abstract:
Large Language Models (LLMs) require continuous updates to maintain accurate and current knowledge as the world evolves. While existing knowledge editing approaches offer various solutions for knowledge updating, they often struggle with sequential editing scenarios and harm the general capabilities of the model, thereby significantly hampering their practical applicability. This paper proposes a two-stage framework combining robust supervised fine-tuning (R-SFT) with model merging for knowledge editing. Our method first fine-tunes the LLM to internalize new knowledge fully, then merges the fine-tuned model with the original foundation model to preserve newly acquired knowledge and general capabilities. Experimental results demonstrate that our approach significantly outperforms existing methods in sequential editing while better preserving the original performance of the model, all without requiring any architectural changes. Code is available at: https://github.com/Applied-Machine-Learning-Lab/MM4KE.
中文: 本文提出了一种结合鲁棒监督微调和模型合并的两阶段框架,能在不改变架构的情况下有效更新大语言模型的知识并保持其通用能力,在连续编辑任务中显著优于现有方法。
English: This paper introduces a two-stage framework that combines robust supervised fine-tuning with model merging to effectively update knowledge in Large Language Models while preserving their general capabilities, outperforming existing methods in sequential editing without architectural modifications.

Authors:Zichuan Fu, Xian Wu, Yejing Wang, Wanyu Wang, Shanshan Ye, Hongzhi Yin, Yi Chang, Yefeng Zheng, Xiangyu Zhao
Title: Training-free LLM Merging for Multi-task Learning
Abstract:
Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing (NLP) tasks. The release of open-source LLMs like LLaMA and Qwen has triggered the development of numerous fine-tuned models tailored for various tasks and languages. In this paper, we explore an important question: is it possible to combine these specialized models to create a unified model with multi-task capabilities. We introduces Hierarchical Iterative Merging (Hi-Merging), a training-free method for unifying different specialized LLMs into a single model. Specifically, Hi-Merging employs model-wise and layer-wise pruning and scaling, guided by contribution analysis, to mitigate parameter conflicts. Extensive experiments on multiple-choice and question-answering tasks in both Chinese and English validate Hi-Merging's ability for multi-task learning. The results demonstrate that Hi-Merging consistently outperforms existing merging techniques and surpasses the performance of models fine-tuned on combined datasets in most scenarios. Code is available at: https://github.com/Applied-Machine-Learning-Lab/Hi-Merging.
中文: 本文提出Hi-Merging方法,通过分层剪枝和缩放将专业大语言模型统一为单一模型,在跨语言多任务学习中优于现有技术且无需额外训练。
English: This paper introduces Hi-Merging, a training-free method that unifies specialized LLMs into a single model through hierarchical pruning and scaling, demonstrating superior multi-task performance across languages compared to existing techniques.

Authors:Zhaochen Hong, Haofei Yu, Jiaxuan You
Title: ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities
Abstract:
Evaluating consistency in large language models (LLMs) is crucial for ensuring reliability, particularly in complex, multi-step interactions between humans and LLMs. Traditional self-consistency methods often miss subtle semantic changes in natural language and functional shifts in code or equations, which can accumulate over multiple transformations. To address this, we propose ConsistencyChecker, a tree-based evaluation framework designed to measure consistency through sequences of reversible transformations, including machine translation tasks and AI-assisted programming tasks. In our framework, nodes represent distinct text states, while edges correspond to pairs of inverse operations. Dynamic and LLM-generated benchmarks ensure a fair assessment of the model's generalization ability and eliminate benchmark leakage. Consistency is quantified based on similarity across different depths of the transformation tree. Experiments on eight models from various families and sizes show that ConsistencyChecker can distinguish the performance of different models. Notably, our consistency scores-computed entirely without using WMT paired data-correlate strongly (r > 0.7) with WMT 2024 auto-ranking, demonstrating the validity of our benchmark-free approach. Our implementation is available at: https://github.com/ulab-uiuc/consistencychecker.
中文摘要:研究者提出了ConsistencyChecker这一基于树状结构的评估框架,通过可逆变换序列量化大语言模型的一致性,实验表明该无基准方法与传统评估指标具有高度相关性。
English Summary: The authors introduce ConsistencyChecker, a tree-based framework that evaluates LLM consistency through reversible transformations, demonstrating strong correlation with established benchmarks without requiring paired data.

Authors:Mingjun Xu, Jinhan Dong, Jue Hou, Zehui Wang, Sihang Li, Zhifeng Gao, Renxin Zhong, Hengxing Cai
Title: MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval
Abstract:
Multimodal document retrieval systems enable information access across text, images, and layouts, benefiting various domains like document-based question answering, report analysis, and interactive content summarization. Rerankers improve retrieval precision by reordering retrieved candidates. However, current multimodal reranking methods remain underexplored, with significant room for improvement in both training strategies and overall effectiveness. Moreover, the lack of explicit reasoning makes it difficult to analyze and optimize these methods further. In this paper, We propose MM-R5, a MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval, aiming to provide a more effective and reliable solution for multimodal reranking tasks. MM-R5 is trained in two stages: supervised fine-tuning (SFT) and reinforcement learning (RL). In the SFT stage, we focus on improving instruction-following and guiding the model to generate complete and high-quality reasoning chains. To support this, we introduce a novel data construction strategy that produces rich, high-quality reasoning data. In the RL stage, we design a task-specific reward framework, including a reranking reward tailored for multimodal candidates and a composite template-based reward to further refine reasoning quality. We conduct extensive experiments on MMDocIR, a challenging public benchmark spanning multiple domains. MM-R5 achieves state-of-the-art performance on most metrics and delivers comparable results to much larger models on the remaining ones. Moreover, compared to the best retrieval-only method, MM-R5 improves recall@1 by over 4%. These results validate the effectiveness of our reasoning-enhanced training pipeline. Our code is available at https://github.com/i2vec/MM-R5 .
中文: 本文提出MM-R5,一种基于强化学习的多模态推理增强重排器,通过生成高质量推理链提升文档检索效果,在基准测试中取得了领先性能。
English: The paper introduces MM-R5, a multimodal reasoning-enhanced reranker using reinforcement learning to improve document retrieval by generating high-quality reasoning chains and achieving state-of-the-art performance on benchmarks.

Authors:Yuliang Xu, Siming Huang, Mingmeng Geng, Yao Wan, Xuanhua Shi, Dongping Chen
Title: code_transformed: The Influence of Large Language Models on Code
Abstract:
Coding remains one of the most fundamental modes of interaction between humans and machines. With the rapid advancement of Large Language Models (LLMs), code generation capabilities have begun to significantly reshape programming practices. This development prompts a central question: Have LLMs transformed code style, and how can such transformation be characterized? In this paper, we present a pioneering study that investigates the impact of LLMs on code style, with a focus on naming conventions, complexity, maintainability, and similarity. By analyzing code from over 19,000 GitHub repositories linked to arXiv papers published between 2020 and 2025, we identify measurable trends in the evolution of coding style that align with characteristics of LLM-generated code. For instance, the proportion of snake\_case variable names in Python code increased from 47% in Q1 2023 to 51% in Q1 2025. Furthermore, we investigate how LLMs approach algorithmic problems by examining their reasoning processes. Given the diversity of LLMs and usage scenarios, among other factors, it is difficult or even impossible to precisely estimate the proportion of code generated or assisted by LLMs. Our experimental results provide the first large-scale empirical evidence that LLMs affect real-world programming style.
中文:大型语言模型正显著影响现实世界的代码风格,通过对数万个GitHub仓库的分析,发现命名规范等编程风格正呈现与AI生成代码一致的变化趋势。
English: Large Language Models are measurably influencing real-world coding styles, as evidenced by trends in naming conventions and other style elements across thousands of GitHub repositories.

Authors:Zheli Zhou, Chenxu Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Weinan Zhang, Yong Yu
Title: Generative Representational Learning of Foundation Models for Recommendation
Abstract:
Developing a single foundation model with the capability to excel across diverse tasks has been a long-standing objective in the field of artificial intelligence. As the wave of general-purpose foundation models sweeps across various domains, their influence has significantly extended to the field of recommendation systems. While recent efforts have explored recommendation foundation models for various generative tasks, they often overlook crucial embedding tasks and struggle with the complexities of multi-task learning, including knowledge sharing & conflict resolution, and convergence speed inconsistencies. To address these limitations, we introduce RecFound, a generative representational learning framework for recommendation foundation models. We construct the first comprehensive dataset for recommendation foundation models covering both generative and embedding tasks across diverse scenarios. Based on this dataset, we propose a novel multi-task training scheme featuring a Task-wise Mixture of Low-rank Experts (TMoLE) to handle knowledge sharing & conflict, a Step-wise Convergence-oriented Sample Scheduler (S2Sched) to address inconsistent convergence, and a Model Merge module to balance the performance across tasks. Experiments demonstrate that RecFound achieves state-of-the-art performance across various recommendation tasks, outperforming existing baselines.
中文摘要:RecFound是一种创新的生成式表征学习框架,通过构建首个覆盖多场景的推荐基础模型综合数据集,并提出包含任务级专家混合、收敛导向调度和模型融合的多任务训练方案,有效解决了知识共享与冲突、收敛不一致等关键问题,在各种推荐任务中实现了最优性能。
English Summary: RecFound is a novel generative representational learning framework that addresses limitations in recommendation foundation models by introducing a comprehensive dataset and a multi-task training scheme with specialized components for knowledge management and convergence optimization, achieving state-of-the-art performance across diverse recommendation tasks.

Authors:Wuzhenghong Wen, Su Pan, yuwei Sun
Title: Schema-R1: A reasoning training approach for schema linking in Text-to-SQL Task
Abstract:
Schema linking is a critical step in Text-to-SQL task, aiming to accurately predict the table names and column names required for the SQL query based on the given question. However, current fine-tuning approaches for schema linking models employ a rote-learning paradigm, excessively optimizing for ground truth schema linking outcomes while compromising reasoning ability. This limitation arises because of the difficulty in acquiring a high-quality reasoning sample for downstream tasks. To address this, we propose Schema-R1, a reasoning schema linking model trained using reinforcement learning. Specifically, Schema-R1 consists of three key steps: constructing small batches of high-quality reasoning samples, supervised fine-tuning for cold-start initialization, and rule-based reinforcement learning training. The final results demonstrate that our method effectively enhances the reasoning ability of the schema linking model, achieving a 10\% improvement in filter accuracy compared to the existing method. Our code is available at https://github.com/hongWin/Schema-R1/.
中文摘要:提出的Schema-R1模型通过强化学习改进Text-to-SQL任务中的模式链接,解决了现有方法死记硬背的局限,将筛选准确率提升了10%。
English Summary: The proposed Schema-R1 model uses reinforcement learning to enhance schema linking in Text-to-SQL tasks, achieving a 10% improvement in filter accuracy by addressing the rote-learning limitations of current methods.

Authors:Samuel Simko, Mrinmaya Sachan, Bernhard Schölkopf, Zhijing Jin
Title: Improving Large Language Model Safety with Contrastive Representation Learning
Abstract:
Large Language Models (LLMs) are powerful tools with profound societal impacts, yet their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks. While existing defenses often struggle to generalize across varying attack types, recent advancements in representation engineering offer promising alternatives. In this work, we propose a defense framework that formulates model defense as a contrastive representation learning (CRL) problem. Our method finetunes a model using a triplet-based loss combined with adversarial hard negative mining to encourage separation between benign and harmful representations. Our experimental results across multiple models demonstrate that our approach outperforms prior representation engineering-based defenses, improving robustness against both input-level and embedding-space attacks without compromising standard performance. Our code is available at https://github.com/samuelsimko/crl-llm-defense
中文: 本研究提出了一种基于对比表征学习的大语言模型防御框架,通过三元组损失和对抗性负样本挖掘增强模型鲁棒性,在保持标准性能的同时有效抵御输入级和嵌入空间攻击。
English: This study introduces a contrastive representation learning framework for defending Large Language Models against adversarial attacks, utilizing triplet loss and adversarial mining to enhance robustness without sacrificing standard performance.

Authors:Zhenyu Hou, Ziniu Hu, Yujiang Li, Rui Lu, Jie Tang, Yuxiao Dong
Title: TreeRL: LLM Reinforcement Learning with On-Policy Tree Search
Abstract:
Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.
中文摘要:TreeRL提出了一种基于策略树搜索的强化学习框架,通过策略性分支和中间监督提升推理性能,无需单独训练奖励模型。
English Summary: TreeRL introduces an on-policy tree search reinforcement learning framework that enhances reasoning performance through strategic branching and intermediate supervision, eliminating the need for separate reward models.

Authors:Maximilian Kreutner, Marlene Lutz, Markus Strohmaier
Title: Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Abstract:
Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse, but have been found to consistently display a progressive left-leaning bias. At the same time, so-called persona or identity prompts have been shown to produce LLM behavior that aligns with socioeconomic groups that the base model is not aligned with. In this work, we analyze whether zero-shot persona prompting with limited information can accurately predict individual voting decisions and, by aggregation, accurately predict positions of European groups on a diverse set of policies. We evaluate if predictions are stable towards counterfactual arguments, different persona prompts and generation methods. Finally, we find that we can simulate voting behavior of Members of the European Parliament reasonably well with a weighted F1 score of approximately 0.793. Our persona dataset of politicians in the 2024 European Parliament and our code are available at https://github.com/dess-mannheim/european_parliament_simulation.
中文摘要:研究表明,零样本角色提示能有效模拟个人投票决策并预测欧洲群体的政策立场,对欧洲议会议员投票行为的模拟加权F1分数达到约0.793。
English Summary: This study demonstrates that zero-shot persona prompting can effectively simulate individual voting decisions and predict policy positions of European groups, achieving a weighted F1 score of approximately 0.793 for simulating European Parliament members' voting behavior.

Authors:Tianqi Du, Haotian Huang, Yifei Wang, Yisen Wang
Title: Long-Short Alignment for Effective Long-Context Modeling in LLMs
Abstract:
Large language models (LLMs) have exhibited impressive performance and surprising emergent properties. However, their effectiveness remains limited by the fixed context window of the transformer architecture, posing challenges for long-context modeling. Among these challenges, length generalization -- the ability to generalize to sequences longer than those seen during training -- is a classical and fundamental problem. In this work, we propose a fresh perspective on length generalization, shifting the focus from the conventional emphasis on input features such as positional encodings or data structures to the output distribution of the model. Specifically, through case studies on synthetic tasks, we highlight the critical role of \textbf{long-short alignment} -- the consistency of output distributions across sequences of varying lengths. Extending this insight to natural language tasks, we propose a metric called Long-Short Misalignment to quantify this phenomenon, uncovering a strong correlation between the metric and length generalization performance. Building on these findings, we develop a regularization term that promotes long-short alignment during training. Extensive experiments validate the effectiveness of our approach, offering new insights for achieving more effective long-context modeling in LLMs. Code is available at https://github.com/PKU-ML/LongShortAlignment.
Chinese: 本研究提出了大语言模型中长度泛化的新视角,强调输出分布的长短对齐重要性,通过设计量化指标和正则化方法,有效提升了长上下文任务的性能表现。
English: This research introduces a novel perspective on length generalization in large language models by emphasizing the importance of long-short alignment in output distributions, proposing a metric and regularization method that significantly improves performance on long-context tasks.

Authors:Mingxuan Du, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, Zhendong Mao
Title: DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Abstract:
Deep Research Agents are a prominent category of LLM-based agents. By autonomously orchestrating multistep web exploration, targeted retrieval, and higher-order synthesis, they transform vast amounts of online information into analyst-grade, citation-rich reports--compressing hours of manual desk research into minutes. However, a comprehensive benchmark for systematically evaluating the capabilities of these agents remains absent. To bridge this gap, we present DeepResearch Bench, a benchmark consisting of 100 PhD-level research tasks, each meticulously crafted by domain experts across 22 distinct fields. Evaluating DRAs is inherently complex and labor-intensive. We therefore propose two novel methodologies that achieve strong alignment with human judgment. The first is a reference-based method with adaptive criteria to assess the quality of generated research reports. The other framework is introduced to evaluate DRA's information retrieval and collection capabilities by assessing its effective citation count and overall citation accuracy. We have open-sourced DeepResearch Bench and key components of these frameworks at https://github.com/Ayanami0730/deep_research_bench to accelerate the development of practical LLM-based agents.
中文: 深度研究代理利用大语言模型自动化复杂研究任务,但缺乏系统性评估基准的问题通过DeepResearch Bench得以解决,该基准提供100个专家设计的任务和与人类判断高度一致的新型评估方法。
English: Deep Research Agents leverage LLMs to automate complex research tasks, but the lack of a comprehensive benchmark is addressed by DeepResearch Bench, which offers 100 expert-designed tasks and novel evaluation methods aligned with human judgment.

Authors:Dinh Viet Cuong, Hoang-Bao Le, An Pham Ngoc Nguyen, Liting Zhou, Cathal Gurrin
Title: Quizzard@INOVA Challenge 2025 -- Track A: Plug-and-Play Technique in Interleaved Multi-Image Model
Abstract:
This paper addresses two main objectives. Firstly, we demonstrate the impressive performance of the LLaVA-NeXT-interleave on 22 datasets across three different tasks: Multi-Image Reasoning, Documents and Knowledge-Based Understanding and Interactive Multi-Modal Communication. Secondly, we add the Dense Channel Integration (DCI) connector to the LLaVA-NeXT-Interleave and compare its performance against the standard model. We find that the standard model achieves the highest overall accuracy, excelling in vision-heavy tasks like VISION, NLVR2, and Fashion200K. Meanwhile, the DCI-enhanced version shows particular strength on datasets requiring deeper semantic coherence or structured change understanding such as MIT-States_PropertyCoherence and SlideVQA. Our results highlight the potential of combining powerful foundation models with plug-and-play techniques for Interleave tasks. The code is available at https://github.com/dinhvietcuong1996/icme25-inova.
Chinese: 该研究展示了LLaVA-NeXT-Interleave在三个任务的22个数据集上的优异表现,其中标准模型在视觉密集型任务中表现突出,而DCI增强版本在语义连贯性和结构化变化理解方面更具优势。
English: The study showcases LLaVA-NeXT-Interleave's strong performance across 22 datasets in three tasks, with the standard model excelling in vision-heavy tasks while the DCI-enhanced version performs better on semantic coherence and structured change understanding.

Authors:Víctor Gallego
Title: Configurable Preference Tuning with Rubric-Guided Synthetic Data
Abstract:
Models of human feedback for AI alignment, such as those underpinning Direct Preference Optimization (DPO), often bake in a singular, static set of preferences, limiting adaptability. This paper challenges the assumption of monolithic preferences by introducing Configurable Preference Tuning (CPT), a novel framework for endowing language models with the ability to dynamically adjust their behavior based on explicit, human-interpretable directives. CPT leverages synthetically generated preference data, conditioned on system prompts derived from structured, fine-grained rubrics that define desired attributes like writing style. By fine-tuning with these rubric-guided preferences, the LLM learns to modulate its outputs at inference time in response to the system prompt, without retraining. This approach not only offers fine-grained control but also provides a mechanism for modeling more nuanced and context-dependent human feedback. Several experimental artifacts, such as training code, generated datasets and fine-tuned models are released at https://github.com/vicgalle/configurable-preference-tuning
中文: 本文提出可配置偏好调优(CPT)框架,通过基于结构化准则生成合成偏好数据,使语言模型能够根据人类可解释的指令动态调整输出行为,突破了传统方法对静态偏好的依赖,无需重新训练即可实现细粒度控制。
English: This paper introduces Configurable Preference Tuning (CPT), a framework that enables language models to dynamically adapt their behavior using human-interpretable directives, overcoming the static preference limitations of methods like DPO by leveraging rubric-guided synthetic data for fine-grained control without retraining.

Authors:Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan
Title: Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Abstract:
Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.
中文: 提出的Manager插件通过自适应整合多层级单模态专家知识,在双塔视觉语言模型和多模态大语言模型中均实现了性能提升,并与多网格算法形成互补,增强了视觉细节的捕捉能力。
English: The proposed Manager plugin enhances Two-Tower Vision-Language Models by adaptively integrating multi-level unimodal expertise, achieving superior performance across VL tasks and MLLM architectures while complementing multi-grid algorithms for richer visual representation.

Authors:Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, Ari Holtzman
Title: AbsenceBench: Language Models Can't Tell What's Missing
Abstract:
Large language models (LLMs) are increasingly capable of processing long inputs and locating specific information within them, as evidenced by their performance on the Needle in a Haystack (NIAH) test. However, while models excel at recalling surprising information, they still struggle to identify clearly omitted information. We introduce AbsenceBench to assesses LLMs' capacity to detect missing information across three domains: numerical sequences, poetry, and GitHub pull requests. AbsenceBench asks models to identify which pieces of a document were deliberately removed, given access to both the original and edited contexts. Despite the apparent straightforwardness of these tasks, our experiments reveal that even state-of-the-art models like Claude-3.7-Sonnet achieve only 69.6% F1-score with a modest average context length of 5K tokens. Our analysis suggests this poor performance stems from a fundamental limitation: Transformer attention mechanisms cannot easily attend to "gaps" in documents since these absences don't correspond to any specific keys that can be attended to. Overall, our results and analysis provide a case study of the close proximity of tasks where models are already superhuman (NIAH) and tasks where models breakdown unexpectedly (AbsenceBench).
中文: 大型语言模型在检测文档中缺失信息方面表现不佳,新的AbsenceBench测试显示,Transformer注意力机制存在根本性局限,无法处理信息空白。
English: Large language models struggle to detect missing information in documents, as demonstrated by their poor performance on the new AbsenceBench test, which reveals a fundamental limitation in Transformer attention mechanisms that cannot process information gaps.

Authors:Pradyut Sekhsaria, Marcel Mateos Salles, Hai Huang, Randall Balestriero
Title: LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
Abstract:
Parameter Efficient FineTuning (PEFT), such as Low-Rank Adaptation (LoRA), aligns pre-trained Large Language Models (LLMs) to particular downstream tasks in a resource-efficient manner. Because efficiency has been the main metric of progress, very little attention has been put in understanding possible catastrophic failures. We uncover one such failure: PEFT encourages a model to search for shortcut solutions to solve its fine-tuning tasks. When very small amount of tokens, e.g., one token per prompt, are correlated with downstream task classes, PEFT makes any pretrained model rely predominantly on that token for decision making. While such spurious tokens may emerge accidentally from incorrect data cleaning, it also opens opportunities for malevolent parties to control a model's behavior from Seamless Spurious Token Injection (SSTI). In SSTI, a small amount of tokens correlated with downstream classes are injected by the dataset creators. At test time, the finetuned LLM's behavior can be controlled solely by injecting those few tokens. We apply SSTI across models from three families (Snowflake Arctic, Apple OpenELM, and Meta LLaMA-3) and four diverse datasets (IMDB, Financial Classification, CommonSense QA, and Bias in Bios). Our findings reveal three astonishing behaviors. First, as few as a single token of SSTI is sufficient to steer a model's decision making. Second, for light SSTI, the reliance on spurious tokens is proportional to the LoRA rank. Lastly, with aggressive SSTI, larger LoRA rank values become preferable to small rank values as it makes the model attend to non-spurious tokens, hence improving robustness.

Authors:Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan
Title: GLAP: General contrastive audio-text pretraining across domains and languages
Abstract:
Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.
中文: GLAP扩展了CLAP,增加了多语言和多领域能力,在音频文本检索中表现优异,并在语音及跨语言任务中显著超越现有方法。
English: GLAP extends CLAP by incorporating multilingual and multi-domain capabilities, achieving competitive performance in audio-text retrieval and excelling in speech and multilingual tasks.

Authors:Weibing Zheng, Laurah Turner, Jess Kropczynski, Murat Ozer, Tri Nguyen, Shane Halse
Title: LLM-as-a-Fuzzy-Judge: Fine-Tuning Large Language Models as a Clinical Evaluation Judge with Fuzzy Logic
Abstract:
Clinical communication skills are critical in medical education, and practicing and assessing clinical communication skills on a scale is challenging. Although LLM-powered clinical scenario simulations have shown promise in enhancing medical students' clinical practice, providing automated and scalable clinical evaluation that follows nuanced physician judgment is difficult. This paper combines fuzzy logic and Large Language Model (LLM) and proposes LLM-as-a-Fuzzy-Judge to address the challenge of aligning the automated evaluation of medical students' clinical skills with subjective physicians' preferences. LLM-as-a-Fuzzy-Judge is an approach that LLM is fine-tuned to evaluate medical students' utterances within student-AI patient conversation scripts based on human annotations from four fuzzy sets, including Professionalism, Medical Relevance, Ethical Behavior, and Contextual Distraction. The methodology of this paper started from data collection from the LLM-powered medical education system, data annotation based on multidimensional fuzzy sets, followed by prompt engineering and the supervised fine-tuning (SFT) of the pre-trained LLMs using these human annotations. The results show that the LLM-as-a-Fuzzy-Judge achieves over 80\% accuracy, with major criteria items over 90\%, effectively leveraging fuzzy logic and LLM as a solution to deliver interpretable, human-aligned assessment. This work suggests the viability of leveraging fuzzy logic and LLM to align with human preferences, advances automated evaluation in medical education, and supports more robust assessment and judgment practices. The GitHub repository of this work is available at https://github.com/2sigmaEdTech/LLMAsAJudge
本文提出了一种结合模糊逻辑和大语言模型的LLM-as-a-Fuzzy-Judge方法,通过自动化评估医学生的临床沟通技能,实现了与医师主观判断相一致的、可解释的人工智能评价系统。
This paper introduces LLM-as-a-Fuzzy-Judge, a method combining fuzzy logic and large language models to provide automated, interpretable evaluation of medical students' clinical communication skills that aligns with nuanced physician judgments.

Authors:Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma
Title: UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions
Abstract:
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.
中文摘要:UITron-Speech是首个端到端GUI智能体,通过处理语音指令和设备截图预测用户操作,利用合成数据集和混合模态训练策略突破文本限制,为人机交互提供更便捷的语音驱动解决方案。
English Summary: UITron-Speech is the first end-to-end GUI agent that processes speech instructions and screenshots to predict user actions, overcoming text-based limitations through synthesized datasets and novel training methods to enhance accessibility in human-computer interaction.

Authors:Hourun Zhu, Chengchao Shen
Title: SDMPrune: Self-Distillation MLP Pruning for Efficient Large Language Models
Abstract:
In spite of strong performance achieved by LLMs, the costs of their deployment are unaffordable. For the compression of LLMs, gradient-based pruning methods present promising effectiveness. However, in these methods, the gradient computation with one-hot labels ignore the potential predictions on other words, thus missing key information for generative capability of the original model. To address this issue, we introduce a self-distillation loss during the pruning phase (rather than post-training) to fully exploit the predictions of the original model, thereby obtaining more accurate gradient information for pruning. Moreover, we find that, compared to attention modules, the predictions of LLM are less sensitive to multilayer perceptron (MLP) modules, which take up more than $5 \times$ parameters (LLaMA3.2-1.2B). To this end, we focus on the pruning of MLP modules, to significantly compress LLM without obvious performance degradation. Experimental results on extensive zero-shot benchmarks demonstrate that our method significantly outperforms existing pruning methods. Furthermore, our method achieves very competitive performance among 1B-scale open source LLMs. The source code and trained weights are available at https://github.com/visresearch/SDMPrune.
中文: 本文在剪枝过程中引入自蒸馏损失以充分利用原始模型的预测来获得更准确的梯度信息,并专注于剪枝MLP模块,在1B规模的大模型中实现了优异的压缩效果和竞争力。
English: This paper introduces a self-distillation loss during pruning to better utilize the original model's predictions for accurate gradient computation and focuses on pruning MLP modules, achieving superior compression and competitive performance among 1B-scale LLMs.

Authors:Yerim Oh, Jun-Hyung Park, Junho Kim, SungHo Kim, SangKeun Lee
Title: Incorporating Domain Knowledge into Materials Tokenization
Abstract:
While language models are increasingly utilized in materials science, typical models rely on frequency-centric tokenization methods originally developed for natural language processing. However, these methods frequently produce excessive fragmentation and semantic loss, failing to maintain the structural and semantic integrity of material concepts. To address this issue, we propose MATTER, a novel tokenization approach that integrates material knowledge into tokenization. Based on MatDetector trained on our materials knowledge base and a re-ranking method prioritizing material concepts in token merging, MATTER maintains the structural integrity of identified material concepts and prevents fragmentation during tokenization, ensuring their semantic meaning remains intact. The experimental results demonstrate that MATTER outperforms existing tokenization methods, achieving an average performance gain of $4\%$ and $2\%$ in the generation and classification tasks, respectively. These results underscore the importance of domain knowledge for tokenization strategies in scientific text processing. Our code is available at https://github.com/yerimoh/MATTER
中文: 提出的MATTER标记化方法融合材料知识以防止碎片化并保持语义完整性,在生成和分类任务中分别实现了4%和2%的平均性能提升,优于现有方法。
English: The proposed MATTER tokenization method integrates materials knowledge to prevent fragmentation and preserve semantic integrity, outperforming existing approaches with average gains of 4% in generation and 2% in classification tasks.

Authors:Kun Zhang, Le Wu, Kui Yu, Guangyi Lv, Dacao Zhang
Title: Evaluating and Improving Robustness in Large Language Models: A Survey and Future Directions
Abstract:
Large Language Models (LLMs) have gained enormous attention in recent years due to their capability of understanding and generating natural languages. With the rapid development and wild-range applications (e.g., Agents, Embodied Intelligence), the robustness of LLMs has received increased attention. As the core brain of many AI applications, the robustness of LLMs requires that models should not only generate consistent contents, but also ensure the correctness and stability of generated content when dealing with unexpeted application scenarios (e.g., toxic prompts, limited noise domain data, outof-distribution (OOD) applications, etc). In this survey paper, we conduct a thorough review of the robustness of LLMs, aiming to provide a comprehensive terminology of concepts and methods around this field and facilitate the community. Specifically, we first give a formal definition of LLM robustness and present the collection protocol of this survey paper. Then, based on the types of perturbated inputs, we organize this survey from the following perspectives: 1) Adversarial Robustness: tackling the problem that prompts are manipulated intentionally, such as noise prompts, long context, data attack, etc; 2) OOD Robustness: dealing with the unexpected real-world application scenarios, such as OOD detection, zero-shot transferring, hallucinations, etc; 3) Evaluation of Robustness: summarizing the new evaluation datasets, metrics, and tools for verifying the robustness of LLMs. After reviewing the representative work from each perspective, we discuss and highlight future opportunities and research directions in this field. Meanwhile, we also organize related works and provide an easy-to-search project (https://github.com/zhangkunzk/Awesome-LLM-Robustness-papers) to support the community.
中文摘要:本综述论文系统梳理了大语言模型的鲁棒性研究,涵盖对抗性和分布外场景的应对策略,建立了相关术语体系和评估方法,并为该领域未来发展提供了研究方向与资源支持。
English Summary: This survey paper provides a comprehensive review of Large Language Models' robustness, covering adversarial and out-of-distribution scenarios while establishing formal definitions and evaluation methods to support future research.

Authors:Jaeho Lee, Atharv Chowdhary
Title: AssertBench: A Benchmark for Evaluating Self-Assertion in Large Language Models
Abstract:
Recent benchmarks have probed factual consistency and rhetorical robustness in Large Language Models (LLMs). However, a knowledge gap exists regarding how directional framing of factually true statements influences model agreement, a common scenario for LLM users. AssertBench addresses this by sampling evidence-supported facts from FEVEROUS, a fact verification dataset. For each (evidence-backed) fact, we construct two framing prompts: one where the user claims the statement is factually correct, and another where the user claims it is incorrect. We then record the model's agreement and reasoning. The desired outcome is that the model asserts itself, maintaining consistent truth evaluation across both framings, rather than switching its evaluation to agree with the user. AssertBench isolates framing-induced variability from the model's underlying factual knowledge by stratifying results based on the model's accuracy on the same claims when presented neutrally. In doing so, this benchmark aims to measure an LLM's ability to "stick to its guns" when presented with contradictory user assertions about the same fact. The complete source code is available at https://github.com/achowd32/assert-bench.
中文摘要:AssertBench是一个新基准,用于评估当用户将同一证据支持的事实表述为正确或错误时,大型语言模型是否能保持一致性的事实判断,从而测试其不因迎合用户而改变评估的能力。
English Summary: AssertBench is a new benchmark that evaluates whether LLMs maintain consistent factual judgments when users frame the same evidence-backed facts as either correct or incorrect, testing their ability to resist switching evaluations to simply agree with users.

Authors:Hanzhi Zhang, Heng Fan, Kewei Sha, Yan Huang, Yunhe Feng
Title: DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration
Abstract:
Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined masks, failing to capture heterogeneous attention patterns. This results in suboptimal token interactions, limiting adaptability and retrieval accuracy in long-sequence tasks. This work introduces a dynamic sparse attention mechanism that assigns adaptive masks at the attention-map level, preserving heterogeneous patterns across layers and heads. Unlike existing approaches, our method eliminates the need for fine-tuning and predefined mask structures while maintaining computational efficiency. By learning context-aware attention structures, it achieves high alignment with full-attention models, ensuring minimal performance degradation while reducing memory and compute overhead. This approach provides a scalable alternative to full attention, enabling the practical deployment of large-scale Large Language Models (LLMs) without sacrificing retrieval performance. DAM is available at: https://github.com/HanzhiZhang-Ulrica/DAM.
中文摘要:本文提出了一种动态稀疏注意力机制,能在注意力图层面自适应分配掩码,在保持计算效率的同时与全注意力模型高度契合,实现了大规模语言模型的可扩展部署。
English Summary: This paper introduces a dynamic sparse attention mechanism that adaptively assigns attention masks at the map level, maintaining computational efficiency while achieving high alignment with full-attention models and enabling scalable deployment of large language models.

Authors:Haritz Puerto, Martin Gubri, Tommaso Green, Seong Joon Oh, Sangdoo Yun
Title: C-SEO Bench: Does Conversational SEO Work?
Abstract:
Large Language Models (LLMs) are transforming search engines into Conversational Search Engines (CSE). Consequently, Search Engine Optimization (SEO) is being shifted into Conversational Search Engine Optimization (C-SEO). We are beginning to see dedicated C-SEO methods for modifying web documents to increase their visibility in CSE responses. However, they are often tested only for a limited breadth of application domains; we do not understand whether certain C-SEO methods would be effective for a broad range of domains. Moreover, existing evaluations consider only a single-actor scenario where only one web document adopts a C-SEO method; in reality, multiple players are likely to competitively adopt the cutting-edge C-SEO techniques, drawing an analogy from the dynamics we have seen in SEO. We present C-SEO Bench, the first benchmark designed to evaluate C-SEO methods across multiple tasks, domains, and number of actors. We consider two search tasks, question answering and product recommendation, with three domains each. We also formalize a new evaluation protocol with varying adoption rates among involved actors. Our experiments reveal that most current C-SEO methods are largely ineffective, contrary to reported results in the literature. Instead, traditional SEO strategies, those aiming to improve the ranking of the source in the LLM context, are significantly more effective. We also observe that as we increase the number of C-SEO adopters, the overall gains decrease, depicting a congested and zero-sum nature of the problem. Our code and data are available at https://github.com/parameterlab/c-seo-bench and https://huggingface.co/datasets/parameterlab/c-seo-bench.
中文: 大型语言模型正在将搜索引擎转变为对话式搜索引擎,促使传统搜索引擎优化向对话式搜索引擎优化转变,但最新研究通过C-SEO Bench基准测试发现,现有方法在跨领域和竞争场景中效果有限,传统优化策略反而更具优势。
English: Large Language Models are evolving search engines into conversational systems, prompting the shift from traditional SEO to C-SEO, but current methods show limited effectiveness across domains and competitive scenarios, as revealed by the new C-SEO Bench benchmark.

Authors:Justin Asher
Title: LeanExplore: A search engine for Lean 4 declarations
Abstract:
The expanding Lean 4 ecosystem poses challenges for navigating its vast libraries. This paper introduces LeanExplore, a search engine for Lean 4 declarations. LeanExplore enables users to semantically search for statements, both formally and informally, across select Lean 4 packages (including Batteries, Init, Lean, Mathlib, PhysLean, and Std). This search capability is powered by a hybrid ranking strategy, integrating scores from a multi-source semantic embedding model (capturing conceptual meaning from formal Lean code, docstrings, AI-generated informal translations, and declaration titles), BM25+ for keyword-based lexical relevance, and a PageRank-based score reflecting declaration importance and interconnectedness. The search engine is accessible via a dedicated website (https://www.leanexplore.com/) and a Python API (https://github.com/justincasher/lean-explore). Furthermore, the database can be downloaded, allowing users to self-host the service. LeanExplore integrates easily with LLMs via the model context protocol (MCP), enabling users to chat with an AI assistant about Lean declarations or utilize the search engine for building theorem-proving agents. This work details LeanExplore's architecture, data processing, functionalities, and its potential to enhance Lean 4 workflows and AI-driven mathematical research
LeanExplore 搜索引擎通过混合排名策略(结合概念嵌入、词法相关性和声明互连性)实现了对 Lean 4 库的语义搜索,解决了庞大生态系统的导航难题,支持网页访问、API调用及大语言模型集成。
The LeanExplore search engine addresses the challenge of navigating Lean 4's extensive libraries by enabling semantic searches through a hybrid ranking system that combines conceptual embeddings, lexical relevance, and declaration interconnectedness, accessible via web interface, API, and LLM integration.

Authors:Ali Asad, Stephen Obadinma, Radin Shayanfar, Xiaodan Zhu
Title: RedDebate: Safer Responses through Multi-Agent Red Teaming Debates
Abstract:
We propose RedDebate, a novel multi-agent debate framework that leverages adversarial argumentation among Large Language Models (LLMs) to proactively identify and mitigate their own unsafe behaviours. Existing AI safety methods often depend heavily on costly human evaluations or isolated single-model assessment, both subject to scalability constraints and oversight risks. RedDebate instead embraces collaborative disagreement, enabling multiple LLMs to critically examine one another's reasoning, and systematically uncovering unsafe blind spots through automated red-teaming, and iteratively improve their responses. We further integrate distinct types of long-term memory that retain learned safety insights from debate interactions. Evaluating on established safety benchmarks such as HarmBench, we demonstrate the proposed method's effectiveness. Debate alone can reduce unsafe behaviours by 17.7%, and when combined with long-term memory modules, achieves reductions exceeding 23.5%. To our knowledge, RedDebate constitutes the first fully automated framework that combines multi-agent debates with red-teaming to progressively enhance AI safety without direct human intervention.(Github Repository: https://github.com/aliasad059/RedDebate)
Chinese: RedDebate是一种创新的多智能体辩论框架,通过大型语言模型之间的对抗性论证来自动识别和减少不安全行为,仅通过辩论即可降低17.7%的不安全行为,结合长期记忆模块后降低幅度超过23.5%。
English: RedDebate is an innovative multi-agent debate framework that uses adversarial argumentation among Large Language Models to automatically identify and mitigate unsafe behaviors, achieving a 17.7% reduction through debate alone and over 23.5% when combined with long-term memory modules.

Authors:Han Zhou, Qitong Xu, Yiheng Dong, Xin Yang
Title: MANBench: Is Your Multimodal Model Smarter than Human?
Abstract:
The rapid advancement of Multimodal Large Language Models (MLLMs) has ignited discussions regarding their potential to surpass human performance in multimodal tasks. In response, we introduce MANBench (Multimodal Ability Norms Benchmark), a bilingual benchmark (English and Chinese) comprising 1,314 questions across nine tasks, spanning knowledge-based and non-knowledge-based domains. MANBench emphasizes intuitive reasoning, seamless cross-modal integration, and real-world complexity, providing a rigorous evaluation framework. Through extensive human experiments involving diverse participants, we compared human performance against state-of-the-art MLLMs. The results indicate that while MLLMs excel in tasks like Knowledge and Text-Image Understanding, they struggle with deeper cross-modal reasoning tasks such as Transmorphic Understanding, Image Consistency, and Multi-image Understanding. Moreover, both humans and MLLMs face challenges in highly complex tasks like Puzzles and Spatial Imagination. MANBench highlights the strengths and limitations of MLLMs, revealing that even advanced models fall short of achieving human-level performance across many domains. We hope MANBench will inspire efforts to bridge the gap between MLLMs and human multimodal capabilities. The code and dataset are available at https://github.com/micdz/MANBench.
中文摘要:MANBench是一个双语基准测试,旨在严格评估多模态大语言模型(MLLMs)在多样化任务中的表现,结果表明尽管MLLMs在知识和图文理解等领域表现优异,但在深层跨模态推理和复杂任务方面仍未能达到人类水平。
English Summary: MANBench is a bilingual benchmark designed to rigorously evaluate Multimodal Large Language Models (MLLMs) across diverse tasks, revealing that while MLLMs excel in certain areas like Knowledge and Text-Image Understanding, they still fall short of human-level performance in deeper cross-modal reasoning and complex tasks.

Authors:Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, Sijia Liu
Title: CyclicReflex: Improving Large Reasoning Models via Cyclical Reflection Token Scheduling
Abstract:
Large reasoning models (LRMs), such as OpenAI's o1 and DeepSeek-R1, harness test-time scaling to perform multi-step reasoning for complex problem-solving. This reasoning process, executed before producing final answers, is often guided by special juncture tokens or textual segments that prompt self-evaluative reflection. We refer to these transition markers and reflective cues as "reflection tokens" (e.g., "wait", "but", "alternatively"). In this work, we treat reflection tokens as a "resource" and introduce the problem of resource allocation, aimed at improving the test-time compute performance of LRMs by adaptively regulating the frequency and placement of reflection tokens. Through empirical analysis, we show that both excessive and insufficient use of reflection tokens, referred to as over-reflection and under-reflection, can degrade model performance. To better understand and manage this trade-off, we draw an analogy between reflection token usage and learning rate scheduling in optimization. Building on this insight, we propose cyclical reflection token scheduling (termed CyclicReflex), a decoding strategy that dynamically modulates reflection token logits using a position-dependent triangular waveform. Experiments on MATH500, AIME2024/2025, and AMC2023 demonstrate that CyclicReflex consistently improves performance across model sizes (1.5B-8B), outperforming standard decoding and more recent approaches such as TIP (thought switching penalty) and S1. Codes are available at https://github.com/OPTML-Group/CyclicReflex.
Chinese: 大型推理模型利用反思标记指导多步推理,本研究提出CyclicReflex动态调度方法,通过优化标记分配策略,在多个数学基准测试中有效提升了模型的计算性能。
English: Large reasoning models use reflection tokens to guide multi-step reasoning, and this work introduces CyclicReflex, a dynamic scheduling method that optimizes their allocation to enhance computational performance across various benchmarks.

Authors:Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer
Title: Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models
Abstract:
As image generators produce increasingly realistic images, concerns about potential misuse continue to grow. Supervised detection relies on large, curated datasets and struggles to generalize across diverse generators. In this work, we investigate the use of pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. While off-the-shelf VLMs exhibit some task-specific reasoning and chain-of-thought prompting offers gains, we show that task-aligned prompting elicits more focused reasoning and significantly improves performance without fine-tuning. Specifically, prefixing the model's response with the phrase "Let's examine the style and the synthesis artifacts" -- a method we call zero-shot-s$^2$ -- boosts Macro F1 scores by 8%-29%. These gains are consistent for two widely used open-source models and across three recent, diverse datasets spanning human faces, objects, and animals with images generated by 16 different models -- demonstrating strong generalization. We further evaluate the approach across three additional model sizes and observe improvements in most dataset-model combinations -- suggesting robustness to model scale. Surprisingly, self-consistency, a behavior previously observed in language reasoning, where aggregating answers from diverse reasoning paths improves performance, also holds in this setting. Even here, zero-shot-s$^2$ scales better than chain-of-thought in most cases -- indicating that it elicits more useful diversity. Our findings show that task-aligned prompts elicit more focused reasoning and enhance latent capabilities in VLMs, like the detection of AI-generated images -- offering a simple, generalizable, and explainable alternative to supervised methods. Our code is publicly available on github: https://github.com/Zoher15/Zero-shot-s2.
中文: 本研究提出了零样本s²方法,通过任务对齐提示显著提升了视觉语言模型在无需微调的情况下检测AI生成图像的能力,并在多种数据集和模型上展现出强大的泛化性。
English: This study introduces zero-shot-s², a task-aligned prompting method that significantly enhances Vision-Language Models' ability to detect AI-generated images without fine-tuning, demonstrating strong generalization across diverse datasets and models.

Authors:Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Title: AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Abstract:
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
中文摘要:AutoMind是一种自适应大型语言模型智能体框架,通过融合专家知识、策略性解决方案探索和动态编码,在自动化数据科学基准测试中展现出卓越性能。
English Summary: AutoMind is an adaptive LLM-agent framework that enhances automated data science by integrating expert knowledge, strategic solution exploration, and dynamic coding, achieving superior performance on benchmarks.

Authors:Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
Title: MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Abstract:
In this paper, we introduce knowledge image generation as a new task, alongside the Massive Multi-Discipline Multi-Tier Knowledge-Image Generation Benchmark (MMMG) to probe the reasoning capability of image generation models. Knowledge images have been central to human civilization and to the mechanisms of human learning -- a fact underscored by dual-coding theory and the picture-superiority effect. Generating such images is challenging, demanding multimodal reasoning that fuses world knowledge with pixel-level grounding into clear explanatory visuals. To enable comprehensive evaluation, MMMG offers 4,456 expert-validated (knowledge) image-prompt pairs spanning 10 disciplines, 6 educational levels, and diverse knowledge formats such as charts, diagrams, and mind maps. To eliminate confounding complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation. Each KG explicitly delineates a target image's core entities and their dependencies. We further introduce MMMG-Score to evaluate generated knowledge images. This metric combines factual fidelity, measured by graph-edit distance between KGs, with visual clarity assessment. Comprehensive evaluations of 16 state-of-the-art text-to-image generation models expose serious reasoning deficits -- low entity fidelity, weak relations, and clutter -- with GPT-4o achieving an MMMG-Score of only 50.20, underscoring the benchmark's difficulty. To spur further progress, we release FLUX-Reason (MMMG-Score of 34.45), an effective and open baseline that combines a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image-prompt pairs.
本文提出知识图像生成作为新任务及MMMG基准,通过评估16个领先模型揭示了当前AI在多模态推理方面的显著不足。
This paper introduces knowledge image generation as a new task and the MMMG benchmark to evaluate multimodal reasoning in AI models, revealing significant gaps in current systems through comprehensive testing of 16 leading models.

Authors:Kangwei Liu, Siyuan Cheng, Bozhong Tian, Xiaozhuan Liang, Yuyang Yin, Meng Han, Ningyu Zhang, Bryan Hooi, Xi Chen, Shumin Deng
Title: ChineseHarm-Bench: A Chinese Harmful Content Detection Benchmark
Abstract:
Large language models (LLMs) have been increasingly applied to automated harmful content detection tasks, assisting moderators in identifying policy violations and improving the overall efficiency and accuracy of content review. However, existing resources for harmful content detection are predominantly focused on English, with Chinese datasets remaining scarce and often limited in scope. We present a comprehensive, professionally annotated benchmark for Chinese content harm detection, which covers six representative categories and is constructed entirely from real-world data. Our annotation process further yields a knowledge rule base that provides explicit expert knowledge to assist LLMs in Chinese harmful content detection. In addition, we propose a knowledge-augmented baseline that integrates both human-annotated knowledge rules and implicit knowledge from large language models, enabling smaller models to achieve performance comparable to state-of-the-art LLMs. Code and data are available at https://github.com/zjunlp/ChineseHarm-bench.
中文摘要:本研究提出了一个全面、专业标注的中文有害内容检测基准,涵盖六大类别并基于真实数据构建,同时通过知识增强基线方法,使较小模型能达到与顶尖大语言模型相媲美的性能。
English Summary: This study introduces a comprehensive, professionally annotated benchmark for Chinese harmful content detection, covering six categories and utilizing real-world data, along with a knowledge-augmented baseline that enhances smaller models' performance to match state-of-the-art LLMs.

Authors:Subham Sekhar Sahoo, Justin Deschenaux, Aaron Gokaslan, Guanghan Wang, Justin Chiu, Volodymyr Kuleshov
Title: The Diffusion Duality
Abstract:
Uniform-state discrete diffusion models hold the promise of fast text generation due to their inherent ability to self-correct. However, they are typically outperformed by autoregressive models and masked diffusion models. In this work, we narrow this performance gap by leveraging a key insight: Uniform-state diffusion processes naturally emerge from an underlying Gaussian diffusion. Our method, Duo, transfers powerful techniques from Gaussian diffusion to improve both training and sampling. First, we introduce a curriculum learning strategy guided by the Gaussian process, doubling training speed by reducing variance. Models trained with curriculum learning surpass autoregressive models in zero-shot perplexity on 3 of 7 benchmarks. Second, we present Discrete Consistency Distillation, which adapts consistency distillation from the continuous to the discrete setting. This algorithm unlocks few-step generation in diffusion language models by accelerating sampling by two orders of magnitude. We provide the code and model checkpoints on the project page: http://s-sahoo.github.io/duo
中文:Duo方法通过将高斯扩散技术迁移到均匀状态离散扩散模型中,采用课程学习策略加速训练,并引入离散一致性蒸馏实现快速少步生成,在部分基准测试中超越了自回归模型的性能。
English: The Duo method enhances uniform-state discrete diffusion models by transferring techniques from Gaussian diffusion, using curriculum learning to accelerate training and discrete consistency distillation to enable fast few-step generation, outperforming autoregressive models on some benchmarks.

Authors:Zhensheng Jin, Xinze Li, Yifan Ji, Chunyi Peng, Zhenghao Liu, Qi Shi, Yukun Yan, Shuo Wang, Furong Peng, Ge Yu
Title: ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization
Abstract:
Recent advances in Chain-of-Thought (CoT) prompting have substantially improved the reasoning capabilities of Large Language Models (LLMs). However, these methods often suffer from overthinking, leading to unnecessarily lengthy or redundant reasoning traces. Existing approaches attempt to mitigate this issue through curating multiple reasoning chains for training LLMs, but their effectiveness is often constrained by the quality of the generated data and prone to overfitting. To address the challenge, we propose Reasoning Compression ThroUgh Stepwise Trials (ReCUT), a novel method aimed at balancing the accuracy and length of reasoning trajectory. Specifically, ReCUT employs a stepwise exploration mechanism and a long-short switched sampling strategy, enabling LLMs to incrementally generate diverse reasoning paths. These paths are evaluated and used to construct preference pairs to train two specialized models (Gemini LLMs)-one optimized for reasoning accuracy, the other for shorter reasoning. A final integrated model is obtained by interpolating the parameters of these two models. Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%, while maintaining or improving reasoning accuracy compared to various baselines. All codes and data will be released via https://github.com/NEUIR/ReCUT.
Chinese: ReCUT方法通过逐步探索生成多样化推理路径并训练分别优化准确性和简洁性的双模型,在保持或提升推理准确率的同时将推理链长度缩减30-50%。
English: The ReCUT method enhances LLM reasoning by generating diverse stepwise paths and training dual specialized models for accuracy and brevity, achieving 30-50% shorter reasoning chains without compromising accuracy.

Authors:Wei Sun, Tingyu Qu, Mingxiao Li, Jesse Davis, Marie-Francine Moens
Title: Mitigating Negative Interference in Multilingual Sequential Knowledge Editing through Null-Space Constraints
Abstract:
Efficiently updating multilingual knowledge in large language models (LLMs), while preserving consistent factual representations across languages, remains a long-standing and unresolved challenge. While deploying separate editing systems for each language might seem viable, this approach incurs substantial costs due to the need to manage multiple models. A more efficient solution involves integrating knowledge updates across all languages into a unified model. However, performing sequential edits across languages often leads to destructive parameter interference, significantly degrading multilingual generalization and the accuracy of injected knowledge. To address this challenge, we propose LangEdit, a novel null-space constrained framework designed to precisely isolate language-specific knowledge updates. The core innovation of LangEdit lies in its ability to project parameter updates for each language onto the orthogonal complement of previous updated subspaces. This approach mathematically guarantees update independence while preserving multilingual generalization capabilities. We conduct a comprehensive evaluation across three model architectures, six languages, and four downstream tasks, demonstrating that LangEdit effectively mitigates parameter interference and outperforms existing state-of-the-art editing methods. Our results highlight its potential for enabling efficient and accurate multilingual knowledge updates in LLMs. The code is available at https://github.com/VRCMF/LangEdit.git.
中文摘要:LangEdit提出了一种零空间约束框架,通过将参数更新投影至正交子空间来隔离语言特定知识更新,在多种模型架构和任务中有效防止参数干扰并保持多语言泛化能力。
English Summary: LangEdit introduces a null-space constrained framework that isolates language-specific knowledge updates in LLMs by projecting parameter changes onto orthogonal subspaces, effectively preventing interference while maintaining multilingual generalization across diverse models and tasks.

Authors:Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
Title: TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
Abstract:
The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.
中文: TaxoAdapt是一种新颖框架,通过迭代式层次分类动态调整大语言模型生成的分类体系,使其适应科学文献的多维特性,在保持粒度和连贯性方面显著优于现有方法。
English: TaxoAdapt is a novel framework that dynamically adapts LLM-generated taxonomies to scientific corpora across multiple dimensions, achieving superior granularity and coherence through iterative hierarchical classification.

Authors:Priyanka Kargupta, Runchu Tian, Jiawei Han
Title: Beyond True or False: Retrieval-Augmented Hierarchical Analysis of Nuanced Claims
Abstract:
Claims made by individuals or entities are oftentimes nuanced and cannot be clearly labeled as entirely "true" or "false" -- as is frequently the case with scientific and political claims. However, a claim (e.g., "vaccine A is better than vaccine B") can be dissected into its integral aspects and sub-aspects (e.g., efficacy, safety, distribution), which are individually easier to validate. This enables a more comprehensive, structured response that provides a well-rounded perspective on a given problem while also allowing the reader to prioritize specific angles of interest within the claim (e.g., safety towards children). Thus, we propose ClaimSpect, a retrieval-augmented generation-based framework for automatically constructing a hierarchy of aspects typically considered when addressing a claim and enriching them with corpus-specific perspectives. This structure hierarchically partitions an input corpus to retrieve relevant segments, which assist in discovering new sub-aspects. Moreover, these segments enable the discovery of varying perspectives towards an aspect of the claim (e.g., support, neutral, or oppose) and their respective prevalence (e.g., "how many biomedical papers believe vaccine A is more transportable than B?"). We apply ClaimSpect to a wide variety of real-world scientific and political claims featured in our constructed dataset, showcasing its robustness and accuracy in deconstructing a nuanced claim and representing perspectives within a corpus. Through real-world case studies and human evaluation, we validate its effectiveness over multiple baselines.
中文摘要:ClaimSpect框架通过将复杂主张分解为层级化的方面与子方面,从语料库中检索相关视角来全面分析不同观点,有效处理科学和政治领域中难以简单判断真伪的声明。
English Summary: ClaimSpect is a framework that deconstructs nuanced claims into hierarchical aspects and sub-aspects, enabling comprehensive analysis by retrieving relevant perspectives from a corpus to represent diverse viewpoints accurately.

Authors:Numaan Naeem, Sarfraz Ahmad, Momina Ahsan, Hasan Iqbal
Title: NeuralNexus at BEA 2025 Shared Task: Retrieval-Augmented Prompting for Mistake Identification in AI Tutors
Abstract:
This paper presents our system for Track 1: Mistake Identification in the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors. The task involves evaluating whether a tutor's response correctly identifies a mistake in a student's mathematical reasoning. We explore four approaches: (1) an ensemble of machine learning models over pooled token embeddings from multiple pretrained language models (LMs); (2) a frozen sentence-transformer using [CLS] embeddings with an MLP classifier; (3) a history-aware model with multi-head attention between token-level history and response embeddings; and (4) a retrieval-augmented few-shot prompting system with a large language model (LLM) i.e. GPT 4o. Our final system retrieves semantically similar examples, constructs structured prompts, and uses schema-guided output parsing to produce interpretable predictions. It outperforms all baselines, demonstrating the effectiveness of combining example-driven prompting with LLM reasoning for pedagogical feedback assessment. Our code is available at https://github.com/NaumanNaeem/BEA_2025.
中文: 本文提出了一种结合检索增强提示与大语言模型的系统,用于识别数学推理中的辅导错误,通过结构化提示和可解释预测超越了所有基线方法。
English: This paper introduces a system for identifying tutoring mistakes in mathematical reasoning by combining retrieval-augmented prompting with large language models, which outperforms baseline methods through structured prompts and interpretable predictions.

Authors:Sergio Burdisso, Esaú Villatoro-Tello, Petr Motlicek
Title: SDialog: A Python Toolkit for Synthetic Dialogue Generation and Analysis
Abstract:
The advancement of conversational AI systems relies on the availability of high-quality, flexible, and reproducible synthetic dialogues for training, evaluation, and benchmarking. SDialog is a modular, extensible Python toolkit designed to address the challenges of synthetic dialogue generation and analysis. By leveraging instruction-tuned Large Language Models (LLMs), SDialog provides abstractions for personas, orchestration, and scenario management, enabling the creation of realistic, diverse, and controllable conversational data for research and development. SDialog supports workflows such as multi-agent simulation and scenario-driven generation, and represents a step forward in the standardization of tools and frameworks for synthetic data generation, a crucial advancement for ensuring reproducibility in today's fast-evolving research landscape.
中文:SDialog是一个模块化的Python工具包,它利用指令调优的大语言模型生成真实可控的合成对话,支持多智能体模拟和场景驱动的工作流程,以推动对话AI研究并确保可复现性。
English: SDialog is a modular Python toolkit that uses instruction-tuned LLMs to generate realistic and controllable synthetic dialogues, supporting multi-agent simulations and scenario-driven workflows to advance conversational AI research and ensure reproducibility.

Authors:Reza Karbasi, Masoud Rahimi, Abdol-Hossein Vahabie, Hadi Moradi
Title: Deep Learning-Based Digitization of Overlapping ECG Images with Open-Source Python Code
Abstract:
This paper addresses the persistent challenge of accurately digitizing paper-based electrocardiogram (ECG) recordings, with a particular focus on robustly handling single leads compromised by signal overlaps-a common yet under-addressed issue in existing methodologies. We propose a two-stage pipeline designed to overcome this limitation. The first stage employs a U-Net based segmentation network, trained on a dataset enriched with overlapping signals and fortified with custom data augmentations, to accurately isolate the primary ECG trace. The subsequent stage converts this refined binary mask into a time-series signal using established digitization techniques, enhanced by an adaptive grid detection module for improved versatility across different ECG formats and scales. Our experimental results demonstrate the efficacy of our approach. The U-Net architecture achieves an IoU of 0.87 for the fine-grained segmentation task. Crucially, our proposed digitization method yields superior performance compared to a well-established baseline technique across both non-overlapping and challenging overlapping ECG samples. For non-overlapping signals, our method achieved a Mean Squared Error (MSE) of 0.0010 and a Pearson Correlation Coefficient (rho) of 0.9644, compared to 0.0015 and 0.9366, respectively, for the baseline. On samples with signal overlap, our method achieved an MSE of 0.0029 and a rho of 0.9641, significantly improving upon the baseline's 0.0178 and 0.8676. This work demonstrates an effective strategy to significantly enhance digitization accuracy, especially in the presence of signal overlaps, thereby laying a strong foundation for the reliable conversion of analog ECG records into analyzable digital data for contemporary research and clinical applications. The implementation is publicly available at this GitHub repository: https://github.com/masoudrahimi39/ECG-code.
本文提出了一种采用U-Net分割和自适应数字化技术的两阶段流程,能精准将纸质心电图转换为数字信号,相比基线方法在处理信号重叠情况时表现出显著优势。
This paper introduces a two-stage pipeline using U-Net segmentation and adaptive digitization to accurately convert paper ECG recordings into digital signals, significantly improving performance especially for overlapping signals compared to baseline methods.

Authors:Xanh Ho, Sunisth Kumar, Yun-Ang Wu, Florian Boudin, Atsuhiro Takasu, Akiko Aizawa
Title: Table-Text Alignment: Explaining Claim Verification Against Tables in Scientific Papers
Abstract:
Scientific claim verification against tables typically requires predicting whether a claim is supported or refuted given a table. However, we argue that predicting the final label alone is insufficient: it reveals little about the model's reasoning and offers limited interpretability. To address this, we reframe table-text alignment as an explanation task, requiring models to identify the table cells essential for claim verification. We build a new dataset by extending the SciTab benchmark with human-annotated cell-level rationales. Annotators verify the claim label and highlight the minimal set of cells needed to support their decision. After the annotation process, we utilize the collected information and propose a taxonomy for handling ambiguous cases. Our experiments show that (i) incorporating table alignment information improves claim verification performance, and (ii) most LLMs, while often predicting correct labels, fail to recover human-aligned rationales, suggesting that their predictions do not stem from faithful reasoning.
Chinese: 本研究将针对表格的科学声明验证重新定义为解释任务,通过构建包含人工标注单元格级依据的数据集,证明尽管融入表格对齐能提升验证性能,但多数大语言模型虽能正确预测却无法产生可信的推理过程。
English: This research reframes scientific claim verification against tables as an explanation task by creating a dataset with human-annotated cell-level rationales, demonstrating that while incorporating table alignment improves verification, most large language models fail to produce faithful reasoning despite correct predictions.

Authors:Guowei Zhong, Ruohong Huan, Mingzhen Wu, Ronghua Liang, Peng Chen
Title: Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
Abstract:
Recent advancements in Multimodal Emotion Recognition (MER) face challenges in addressing both modality missing and Out-Of-Distribution (OOD) data simultaneously. Existing methods often rely on specific models or introduce excessive parameters, which limits their practicality. To address these issues, we propose a novel robust MER framework, Causal Inference Distiller (CIDer), and introduce a new task, Random Modality Feature Missing (RMFM), to generalize the definition of modality missing. CIDer integrates two key components: a Model-Specific Self-Distillation (MSSD) module and a Model-Agnostic Causal Inference (MACI) module. MSSD enhances robustness under the RMFM task through a weight-sharing self-distillation approach applied across low-level features, attention maps, and high-level representations. Additionally, a Word-level Self-aligned Attention Module (WSAM) reduces computational complexity, while a Multimodal Composite Transformer (MCT) facilitates efficient multimodal fusion. To tackle OOD challenges, MACI employs a tailored causal graph to mitigate label and language biases using a Multimodal Causal Module (MCM) and fine-grained counterfactual texts. Notably, MACI can independently enhance OOD generalization with minimal additional parameters. Furthermore, we also introduce the new repartitioned MER OOD datasets. Experimental results demonstrate that CIDer achieves robust performance in both RMFM and OOD scenarios, with fewer parameters and faster training compared to state-of-the-art methods. The implementation of this work is publicly accessible at https://github.com/gw-zhong/CIDer.
中文摘要:提出的因果推理蒸馏器(CIDer)框架通过自蒸馏和因果推理模块,解决了多模态情感识别中模态缺失和分布外数据的双重挑战,以更少参数和更快训练实现了鲁棒性能。
English Summary: The proposed Causal Inference Distiller (CIDer) framework addresses simultaneous modality missing and Out-Of-Distribution challenges in Multimodal Emotion Recognition through self-distillation and causal inference modules, achieving robust performance with fewer parameters and faster training.

Authors:Tony Alex, Wish Suharitdamrong, Sara Atito, Armin Mustafa, Philip J. B. Jackson, Imran Razzak, Muhammad Awais
Title: PAL: Probing Audio Encoders via LLMs -- A Study of Information Transfer from Audio Encoders to LLMs
Abstract:
The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs. Although application-focused developments, particularly in curating training data for specific capabilities e.g., audio reasoning, have progressed rapidly, the underlying mechanisms that govern efficient transfer of rich semantic representations from audio encoders to LLMs remain under-explored. We conceptualize effective audio-LLM interaction as the LLM's ability to proficiently probe the audio encoder representations to satisfy textual queries. This paper presents a systematic investigation on how architectural design choices can affect that. Beginning with a standard Pengi/LLaVA-style audio-LLM architecture, we propose and evaluate several modifications guided by hypotheses derived from mechanistic interpretability studies and LLM operational principles. Our experiments demonstrate that: (1) delaying audio integration until the LLM's initial layers establish textual context that enhances its ability to probe the audio representations for relevant information; (2) the LLM can proficiently probe audio representations exclusively through LLM layer's attention submodule, without requiring propagation to its Feed-Forward Network (FFN) submodule; (3) an efficiently integrated ensemble of diverse audio encoders provides richer, complementary representations, thereby broadening the LLM's capacity to probe a wider spectrum of audio information. All hypotheses are evaluated using an identical three-stage training curriculum on a dataset of 5.6 million audio-text pairs, ensuring controlled comparisons. Our final architecture, which incorporates all proposed modifications, achieves relative improvements from 10\% to 60\% over the baseline, validating our approach to optimizing cross-modal information transfer in audio-LLMs. Project page: https://ta012.github.io/PAL/
中文: 本研究系统探索了增强音频-大语言模型交互的架构改进,证明延迟音频集成、仅注意力探测和多编码器集成能显著提升跨模态信息传递与模型性能。
English: This study systematically explores architectural modifications to enhance audio-LLM interactions, demonstrating that delayed audio integration, attention-only probing, and diverse encoder ensembles significantly improve cross-modal information transfer and performance.

Authors:Yingjin Song, Yupei Du, Denis Paperno, Albert Gatt
Title: Burn After Reading: Do Multimodal Large Language Models Truly Capture Order of Events in Image Sequences?
Abstract:
This paper introduces the TempVS benchmark, which focuses on temporal grounding and reasoning capabilities of Multimodal Large Language Models (MLLMs) in image sequences. TempVS consists of three main tests (i.e., event relation inference, sentence ordering and image ordering), each accompanied with a basic grounding test. TempVS requires MLLMs to rely on both visual and linguistic modalities to understand the temporal order of events. We evaluate 38 state-of-the-art MLLMs, demonstrating that models struggle to solve TempVS, with a substantial performance gap compared to human capabilities. We also provide fine-grained insights that suggest promising directions for future research. Our TempVS benchmark data and code are available at https://github.com/yjsong22/TempVS.
中文: TempVS基准测试评估多模态大语言模型在图像序列中的时序推理能力,揭示了与人类能力间的显著差距,并为未来研究提供了方向性见解。
English: The TempVS benchmark evaluates Multimodal Large Language Models' temporal reasoning in image sequences, revealing significant performance gaps compared to humans while offering insights for future research.

Authors:Xiaohan Yu, Pu Jian, Chong Chen
Title: TableRAG: A Retrieval Augmented Generation Framework for Heterogeneous Document Reasoning
Abstract:
Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an hybrid framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
中文: TableRAG提出了一种基于SQL的框架,通过保留表格结构和实现复杂推理,有效解决了现有RAG方法在处理异构文档时的局限性,在公共数据集和新开发的HeteQA基准测试中均达到了最优性能。
English: TableRAG introduces an SQL-based framework that overcomes the limitations of existing RAG methods in handling heterogeneous documents by preserving tabular structures and enabling complex reasoning, achieving state-of-the-art performance on both public datasets and the newly developed HeteQA benchmark.

Authors:Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli
Title: Discrete Audio Tokens: More Than a Survey!
Abstract:
Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern large language models (LLMs). As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic language modeling, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.
Chinese: 本文系统性地回顾和评估了离散音频标记器在语音、音乐和通用音频领域的表现,通过重构、下游任务和声学语言建模的基准测试,揭示了当前方法的局限性并为未来研究提供了指导方向。
English: This paper provides a systematic review and benchmark of discrete audio tokenizers across speech, music, and general audio domains, evaluating their performance on reconstruction, downstream tasks, and acoustic language modeling while highlighting limitations and future research directions.

Authors:Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing Wu, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
Title: Omni-DPO: A Dual-Perspective Paradigm for Dynamic Preference Learning of LLMs
Abstract:
Direct Preference Optimization (DPO) has become a cornerstone of reinforcement learning from human feedback (RLHF) due to its simplicity and efficiency. However, existing DPO-based approaches typically treat all preference pairs uniformly, ignoring critical variations in their inherent quality and learning utility, leading to suboptimal data utilization and performance. To address this challenge, we propose Omni-DPO, a dual-perspective optimization framework that jointly accounts for (1) the inherent quality of each preference pair and (2) the model's evolving performance on those pairs. By adaptively weighting samples according to both data quality and the model's learning dynamics during training, Omni-DPO enables more effective training data utilization and achieves better performance. Experimental results on various models and benchmarks demonstrate the superiority and generalization capabilities of Omni-DPO. On textual understanding tasks, Gemma-2-9b-it finetuned with Omni-DPO beats the leading LLM, Claude 3 Opus, by a significant margin of 6.7 points on the Arena-Hard benchmark. On mathematical reasoning tasks, Omni-DPO consistently outperforms the baseline methods across all benchmarks, providing strong empirical evidence for the effectiveness and robustness of our approach. Code and models will be available at https://github.com/pspdada/Omni-DPO.
中文摘要:Omni-DPO是一种新颖的双视角优化框架,通过基于数据固有质量和模型学习动态的自适应加权机制改进直接偏好优化方法,在多个基准测试中实现了卓越性能。
English Summary: Omni-DPO is a novel dual-perspective optimization framework that enhances Direct Preference Optimization by adaptively weighting preference pairs based on both inherent data quality and the model's learning dynamics, achieving superior performance across various benchmarks.

Authors:Javad Rajabi, Soroush Mehraban, Seyedmorteza Sadat, Babak Taati
Title: Token Perturbation Guidance for Diffusion Models
Abstract:
Classifier-free guidance (CFG) has become an essential component of modern diffusion models to enhance both generation quality and alignment with input conditions. However, CFG requires specific training procedures and is limited to conditional generation. To address these limitations, we propose Token Perturbation Guidance (TPG), a novel method that applies perturbation matrices directly to intermediate token representations within the diffusion network. TPG employs a norm-preserving shuffling operation to provide effective and stable guidance signals that improve generation quality without architectural changes. As a result, TPG is training-free and agnostic to input conditions, making it readily applicable to both conditional and unconditional generation. We further analyze the guidance term provided by TPG and show that its effect on sampling more closely resembles CFG compared to existing training-free guidance techniques. Extensive experiments on SDXL and Stable Diffusion 2.1 show that TPG achieves nearly a 2$\times$ improvement in FID for unconditional generation over the SDXL baseline, while closely matching CFG in prompt alignment. These results establish TPG as a general, condition-agnostic guidance method that brings CFG-like benefits to a broader class of diffusion models. The code is available at https://github.com/TaatiTeam/Token-Perturbation-Guidance
无分类器引导(CFG)虽能提升扩散模型的生成质量和对齐效果,但需特定训练且仅适用于条件生成;而提出的令牌扰动引导(TPG)通过直接扰动中间令牌表示,实现了无需训练、条件无关的通用引导方法,在无条件生成中显著优化FID指标并保持与提示的高度匹配。
Classifier-free guidance (CFG) improves diffusion models but requires specific training and is limited to conditional generation, whereas the proposed Token Perturbation Guidance (TPG) offers a training-free, condition-agnostic approach that enhances generation quality and alignment by perturbing token representations, achieving significant improvements in FID and prompt adherence.

Authors:Benjamin Reichman, Constantin Patsch, Jack Truxal, Atishay Jain, Larry Heck
Title: Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
Abstract:
In outside knowledge visual question answering (OK-VQA), the model must identify relevant visual information within an image and incorporate external knowledge to accurately respond to a question. Extending this task to a visually grounded dialogue setting based on videos, a conversational model must both recognize pertinent visual details over time and answer questions where the required information is not necessarily present in the visual information. Moreover, the context of the overall conversation must be considered for the subsequent dialogue. To explore this task, we introduce a dataset comprised of $2,017$ videos with $5,986$ human-annotated dialogues consisting of $40,954$ interleaved dialogue turns. While the dialogue context is visually grounded in specific video segments, the questions further require external knowledge that is not visually present. Thus, the model not only has to identify relevant video parts but also leverage external knowledge to converse within the dialogue. We further provide several baselines evaluated on our dataset and show future challenges associated with this task. The dataset is made publicly available here: https://github.com/c-patsch/OKCV.
中文摘要:本研究提出了一个视觉对话数据集,要求模型识别相关视频片段并利用外部知识回答视觉信息中未包含的问题,同时通过基线评估揭示了该任务未来的挑战。
English Summary: This study introduces a dataset for visually grounded dialogue tasks requiring models to identify relevant video segments and incorporate external knowledge to answer questions not present in the visual content, with baseline evaluations highlighting future challenges.

Authors:Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Title: VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Abstract:
Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.
中文: 本文提出了VerIF验证方法,结合基于规则和大型语言模型的验证技术,显著提升了指令跟随的强化学习效果,在保持通用能力的同时实现了最优性能与良好泛化能力。
English: This paper introduces VerIF, a verification method combining rule-based and LLM-based approaches to enhance reinforcement learning for instruction following, achieving state-of-the-art performance and generalizability without compromising general capabilities.

Authors:Chengpeng Li, Zhengyang Tang, Ziniu Li, Mingfeng Xue, Keqin Bao, Tian Ding, Ruoyu Sun, Benyou Wang, Xiang Wang, Junyang Lin, Dayiheng Liu
Title: CoRT: Code-integrated Reasoning within Thinking
Abstract:
Large Reasoning Models (LRMs) like o1 and DeepSeek-R1 have shown remarkable progress in natural language reasoning with long chain-of-thought (CoT), yet they remain inefficient or inaccurate when handling complex mathematical operations. Addressing these limitations through computational tools (e.g., computation libraries and symbolic solvers) is promising, but it introduces a technical challenge: Code Interpreter (CI) brings external knowledge beyond the model's internal text representations, thus the direct combination is not efficient. This paper introduces CoRT, a post-training framework for teaching LRMs to leverage CI effectively and efficiently. As a first step, we address the data scarcity issue by synthesizing code-integrated reasoning data through Hint-Engineering, which strategically inserts different hints at appropriate positions to optimize LRM-CI interaction. We manually create 30 high-quality samples, upon which we post-train models ranging from 1.5B to 32B parameters, with supervised fine-tuning, rejection fine-tuning and reinforcement learning. Our experimental results demonstrate that Hint-Engineering models achieve 4\% and 8\% absolute improvements on DeepSeek-R1-Distill-Qwen-32B and DeepSeek-R1-Distill-Qwen-1.5B respectively, across five challenging mathematical reasoning datasets. Furthermore, Hint-Engineering models use about 30\% fewer tokens for the 32B model and 50\% fewer tokens for the 1.5B model compared with the natural language models. The models and code are available at https://github.com/ChengpengLi1003/CoRT.
中文:CoRT是一种后训练框架,通过提示工程集成代码解释器,有效提升大型推理模型在数学推理中的性能,显著减少计算资源消耗并提高准确率。
English: CoRT is a post-training framework that enhances Large Reasoning Models' efficiency in mathematical reasoning by integrating Code Interpreters through Hint-Engineering, achieving significant performance improvements and token reduction.

Authors:Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation
Abstract:
AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.
中文:ComfyUI-R1是首个通过两阶段训练框架实现自动化工作流生成的大型推理模型,在格式有效性和结构准确性方面显著超越了当前最先进的闭源模型。
English: ComfyUI-R1 is the first large reasoning model that automates AI workflow generation through a two-stage training framework, achieving superior performance in format validity and structural accuracy compared to leading closed-source models.

Authors:Haoyi Song, Ruihan Ji, Naichen Shi, Fan Lai, Raed Al Kontar
Title: Inv-Entropy: A Fully Probabilistic Framework for Uncertainty Quantification in Language Models
Abstract:
Large language models (LLMs) have transformed natural language processing, but their reliable deployment requires effective uncertainty quantification (UQ). Existing UQ methods are often heuristic and lack a probabilistic foundation. This paper begins by providing a theoretical justification for the role of perturbations in UQ for LLMs. We then introduce a dual random walk perspective, modeling input-output pairs as two Markov chains with transition probabilities defined by semantic similarity. Building on this, we propose a fully probabilistic framework based on an inverse model, which quantifies uncertainty by evaluating the diversity of the input space conditioned on a given output through systematic perturbations. Within this framework, we define a new uncertainty measure, Inv-Entropy. A key strength of our framework is its flexibility: it supports various definitions of uncertainty measures, embeddings, perturbation strategies, and similarity metrics. We also propose GAAP, a perturbation algorithm based on genetic algorithms, which enhances the diversity of sampled inputs. In addition, we introduce a new evaluation metric, Temperature Sensitivity of Uncertainty (TSU), which directly assesses uncertainty without relying on correctness as a proxy. Extensive experiments demonstrate that Inv-Entropy outperforms existing semantic UQ methods. The code to reproduce the results can be found at https://github.com/UMDataScienceLab/Uncertainty-Quantification-for-LLMs.
中文摘要:本文通过将输入-输出对建模为双马尔可夫链,提出了基于逆模型的概率框架和Inv-熵不确定性度量方法,实验证明其在语义不确定性量化方面优于现有方法。
English Summary: This paper introduces a probabilistic framework for uncertainty quantification in large language models by modeling input-output pairs as dual Markov chains and proposing Inv-Entropy as a novel uncertainty measure, with experiments showing its superiority over existing methods.

Authors:Tianjun Yao, Haoxuan Li, Zhiqiang Shen, Pan Li, Tongliang Liu, Kun Zhang
Title: Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering
Abstract:
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains, but their reliability is hindered by the outdated knowledge and hallucinations. Retrieval-Augmented Generation mitigates these issues by grounding LLMs with external knowledge; however, most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning. Knowledge graphs, which represent facts as relational triples, offer a more structured and compact alternative. Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering (KGQA), with a significant proportion adopting the retrieve-then-reasoning paradigm. In this framework, graph-based retrievers have demonstrated strong empirical performance, yet they still face challenges in generalization ability. In this work, we propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA. RAPL addresses these limitations through three aspects: (1) a two-stage labeling strategy that combines heuristic signals with parametric models to provide causally grounded supervision; (2) a model-agnostic graph transformation approach to capture both intra- and inter-triple interactions, thereby enhancing representational capacity; and (3) a path-based reasoning strategy that facilitates learning from the injected rational knowledge, and supports downstream reasoner through structured inputs. Empirically, RAPL outperforms state-of-the-art methods by $2.66\%-20.34\%$, and significantly reduces the performance gap between smaller and more powerful LLM-based reasoners, as well as the gap under cross-dataset settings, highlighting its superior retrieval capability and generalizability. Codes are available at: https://github.com/tianyao-aka/RAPL.
中文: 大语言模型因知识过时和幻觉问题影响可靠性,而RAPL框架通过创新的图检索方法增强结构化推理和泛化能力,有效提升了知识图谱问答的性能。
English: Large Language Models face reliability issues due to outdated knowledge and hallucinations, which the proposed RAPL framework addresses through a novel graph retrieval approach that enhances structured reasoning and generalizability in knowledge graph question answering.

Authors:Beomsik Cho, Jaehyung Kim
Title: Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs
Abstract:
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across various multimodal tasks by integrating visual perception with language understanding. However, conventional decoding strategies of LVLMs often fail to successfully utilize visual information, leading to visually ungrounded responses. While various approaches have been proposed to address this limitation, they typically require additional training, multi-step inference procedures, or external model dependencies. This paper introduces ReVisiT, a simple yet effective decoding method that references vision tokens to guide the text generation process in LVLMs. Our approach leverages the semantic information embedded within vision tokens by projecting them into the text token distribution space, and dynamically selecting the most relevant vision token at each decoding step through constrained divergence minimization. This selected vision token is then used to refine the output distribution to better incorporate visual semantics. Experiments on three LVLM hallucination benchmarks with two recent LVLMs demonstrate that ReVisiT consistently enhances visual grounding with minimal computational overhead. Moreover, our method achieves competitive or superior results relative to state-of-the-art baselines while reducing computational costs for up to $2\times$.
中文: ReVisiT是一种新颖的解码方法,通过在文本生成过程中动态参考视觉标记来增强大型视觉语言模型的视觉基础能力,无需额外训练即可显著减少幻觉并降低计算成本。
English: ReVisiT is a novel decoding method that enhances visual grounding in Large Vision-Language Models by dynamically referencing vision tokens during text generation, significantly reducing hallucinations and computational costs without requiring additional training.

Authors:Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong
Title: ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Abstract:
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
中文: ReasonMed作为迄今最大的医学推理数据集,通过多智能体生成与验证流程构建了37万高质量样本,其训练的模型在医学问答中表现卓越,如ReasonMed-7B在PubMedQA上较优模型提升超4.6%。
English: ReasonMed, the largest medical reasoning dataset with 370k high-quality examples, enhances LLM training by integrating detailed reasoning with concise summaries, enabling models like ReasonMed-7B to surpass previous benchmarks by over 4% on medical QA tasks.

Authors:Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu
Title: Give Me FP32 or Give Me Death? Challenges and Solutions for Reproducible Reasoning
Abstract:
Large Language Models (LLMs) are now integral across various domains and have demonstrated impressive performance. Progress, however, rests on the premise that benchmark scores are both accurate and reproducible. We demonstrate that the reproducibility of LLM performance is fragile: changing system configuration such as evaluation batch size, GPU count, and GPU version can introduce significant difference in the generated responses. This issue is especially pronounced in reasoning models, where minor rounding differences in early tokens can cascade into divergent chains of thought, ultimately affecting accuracy. For instance, under bfloat16 precision with greedy decoding, a reasoning model like DeepSeek-R1-Distill-Qwen-7B can exhibit up to 9% variation in accuracy and 9,000 tokens difference in response length due to differences in GPU count, type, and evaluation batch size. We trace the root cause of this variability to the non-associative nature of floating-point arithmetic under limited numerical precision. This work presents the first systematic investigation into how numerical precision affects reproducibility in LLM inference. Through carefully controlled experiments across various hardware, software, and precision settings, we quantify when and how model outputs diverge. Our analysis reveals that floating-point precision -- while critical for reproducibility -- is often neglected in evaluation practices. Inspired by this, we develop a lightweight inference pipeline, dubbed LayerCast, that stores weights in 16-bit precision but performs all computations in FP32, balancing memory efficiency with numerical stability. Code is available at https://github.com/nanomaoli/llm_reproducibility.
中文: 大语言模型在不同硬件配置下因浮点运算的非结合性导致结果可复现性脆弱,推理模型准确率波动高达9%,为此开发了LayerCast轻量推理框架以平衡计算稳定性与内存效率。
English: Large Language Models exhibit fragile reproducibility due to floating-point arithmetic variations under different hardware configurations, with reasoning models showing up to 9% accuracy fluctuations, prompting the development of LayerCast for stable inference.

Authors:Prameshwar Thiyagarajan, Vaishnavi Parimi, Shamant Sai, Soumil Garg, Zhangir Meirbek, Nitin Yarlagadda, Kevin Zhu, Chris Kim
Title: UniToMBench: Integrating Perspective-Taking to Improve Theory of Mind in LLMs
Abstract:
Theory of Mind (ToM), the ability to understand the mental states of oneself and others, remains a challenging area for large language models (LLMs), which often fail to predict human mental states accurately. In this paper, we introduce UniToMBench, a unified benchmark that integrates the strengths of SimToM and TOMBENCH to systematically improve and assess ToM capabilities in LLMs by integrating multi-interaction task designs and evolving story scenarios. Supported by a custom dataset of over 1,000 hand-written scenarios, UniToMBench combines perspective-taking techniques with diverse evaluation metrics to better stimulate social cognition in LLMs. Through evaluation, we observe that while models like GPT-4o and GPT-4o Mini show consistently high accuracy in tasks involving emotional and belief-related scenarios, with results usually above 80%, there is significant variability in their performance across knowledge-based tasks. These results highlight both the strengths and limitations of current LLMs in ToM-related tasks, underscoring the value of UniToMBench as a comprehensive tool for future development. Our code is publicly available here: https://github.com/Shamant/unifiedtombenchmark.
Chinese: UniToMBench作为一个统一基准,通过整合多交互任务和动态情景来提升和评估大语言模型的心理理论能力,结果表明尽管GPT-4o等模型在情感和信念任务中表现出色,但在知识型任务中的表现存在显著差异。
English: UniToMBench is a unified benchmark designed to enhance and evaluate Theory of Mind capabilities in large language models by integrating multi-interaction tasks and evolving scenarios, revealing that while models like GPT-4o excel in emotional and belief-based tasks, their performance varies significantly in knowledge-based contexts.

Authors:Siheng Li, Zhanhui Zhou, Wai Lam, Chao Yang, Chaochao Lu
Title: RePO: Replay-Enhanced Policy Optimization
Abstract:
Reinforcement learning (RL) is vital for optimizing large language models (LLMs). Recent Group Relative Policy Optimization (GRPO) estimates advantages using multiple on-policy outputs per prompt, leading to high computational costs and low data efficiency. To address this, we introduce Replay-Enhanced Policy Optimization (RePO), which leverages diverse replay strategies to retrieve off-policy samples from a replay buffer, allowing policy optimization based on a broader and more diverse set of samples for each prompt. Experiments on five LLMs across seven mathematical reasoning benchmarks demonstrate that RePO achieves absolute average performance gains of $18.4$ and $4.1$ points for Qwen2.5-Math-1.5B and Qwen3-1.7B, respectively, compared to GRPO. Further analysis indicates that RePO increases computational cost by $15\%$ while raising the number of effective optimization steps by $48\%$ for Qwen3-1.7B, with both on-policy and off-policy sample numbers set to $8$. The repository can be accessed at https://github.com/SihengLi99/RePO.
中文: RePO通过回放策略利用离策略样本进行更高效的策略优化,在适度增加计算成本的同时,相比GRPO实现了显著的性能提升。
English: RePO introduces replay strategies to utilize off-policy samples for more efficient policy optimization in LLMs, achieving significant performance gains over GRPO with a moderate increase in computational cost.

Authors:Val Andrei Fajardo, David B. Emerson, Amandeep Singh, Veronica Chatrath, Marcelo Lotif, Ravi Theja, Alex Cheung, Izuki Matsuba
Title: FedRAG: A Framework for Fine-Tuning Retrieval-Augmented Generation Systems
Abstract:
Retrieval-augmented generation (RAG) systems have been shown to be effective in addressing many of the drawbacks of relying solely on the parametric memory of large language models. Recent work has demonstrated that RAG systems can be improved via fine-tuning of their retriever and generator models. In this work, we introduce FedRAG, a framework for fine-tuning RAG systems across centralized and federated architectures. FedRAG supports state-of-the-art fine-tuning methods, offering a simple and intuitive interface and a seamless conversion from centralized to federated training tasks. FedRAG is also deeply integrated with the modern RAG ecosystem, filling a critical gap in available tools.
中文: FedRAG是一个框架,支持在集中式和联邦式架构下对检索增强生成系统进行微调,集成了先进方法与现代RAG工具,填补了现有工具的空白。
English: FedRAG is a framework that enables fine-tuning of retrieval-augmented generation systems across both centralized and federated architectures, integrating advanced methods and modern RAG tools to bridge existing gaps.

Authors:Nadezhda Chirkova, Tunde Oluwaseyi Ajayi, Seth Aycock, Zain Muhammad Mujahid, Vladana Perlić, Ekaterina Borisova, Markarit Vartampetian
Title: LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Abstract:
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that LLM-as-a-qualitative-judge correctly recognizes instance-specific issues in 2/3 cases and is capable of producing error type reports resembling the reports composed by human annotators. Our code and data are publicly available at https://github.com/tunde-ajayi/llm-as-a-qualitative-judge.
中文: 本文提出LLM作为定性评判者的方法,通过生成自然语言生成系统中常见问题的结构化报告为开发者提供改进见解,并在多个数据集评估中验证了其识别具体问题和生成类人工报告的能力。
English: This paper introduces LLM-as-a-qualitative-judge, an approach that generates structured reports of common issues in natural language generation systems to provide developers with actionable insights, demonstrating its effectiveness through evaluations on multiple datasets.

Authors:Zheqi He, Yesheng Liu, Jing-shu Zheng, Xuejing Li, Jin-Ge Yao, Bowen Qin, Richeng Xuan, Xi Yang
Title: FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
Abstract:
We present FlagEvalMM, an open-source evaluation framework designed to comprehensively assess multimodal models across a diverse range of vision-language understanding and generation tasks, such as visual question answering, text-to-image/video generation, and image-text retrieval. We decouple model inference from evaluation through an independent evaluation service, thus enabling flexible resource allocation and seamless integration of new tasks and models. Moreover, FlagEvalMM utilizes advanced inference acceleration tools (e.g., vLLM, SGLang) and asynchronous data loading to significantly enhance evaluation efficiency. Extensive experiments show that FlagEvalMM offers accurate and efficient insights into model strengths and limitations, making it a valuable tool for advancing multimodal research. The framework is publicly accessible at https://github.com/flageval-baai/FlagEvalMM.
中文: FlagEvalMM是一个开源框架,通过将模型推理与评估解耦并采用加速工具,能高效评估多模态模型在各种视觉语言任务上的表现,为研究提供准确性能分析。
English: FlagEvalMM is an open-source framework that efficiently evaluates multimodal models across diverse vision-language tasks by decoupling inference from evaluation and utilizing acceleration tools for enhanced performance insights.

Authors:Dianyi Wang, Wei Song, Yikun Wang, Siyuan Wang, Kaicheng Yu, Zhongyu Wei, Jiaqi Wang
Title: Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
Abstract:
Typical large vision-language models (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.
中文: 现有的大规模视觉语言模型常因仅关注文本而忽略细粒度视觉信息,但提出的自回归语义视觉重建(ASVR)方法通过联合学习视觉与文本模态,显著提升了多模态理解能力,在多个基准测试中取得明显性能增益。
English: Current large vision-language models often neglect fine-grained visual details by focusing only on text, but the proposed Autoregressive Semantic Visual Reconstruction (ASVR) method enhances multimodal understanding by jointly learning visual and textual modalities, achieving significant performance improvements across benchmarks.

Authors:Haozhen Zhang, Tao Feng, Jiaxuan You
Title: Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
Abstract:
The rapid emergence of diverse large language models (LLMs) has spurred the development of LLM routers that assign user queries to the most suitable model. However, existing LLM routers typically perform a single-round, one-to-one mapping (\textit{i.e.}, assigning each query to a single model in isolation), which limits their capability to tackle complex tasks that demand the complementary strengths of multiple LLMs. In this paper, we present \textbf{Router-R1}, a reinforcement learning (RL)-based framework that formulates multi-LLM routing and aggregation as a sequential decision process. Router-R1 instantiates the router itself as a capable LLM, leveraging its reasoning ability to interleave "think" actions (internal deliberation) with "route" actions (dynamic model invocation), and integrates each response into its evolving context. To facilitate learning, we employ a lightweight rule-based reward comprising format rewards, final outcome rewards, and a novel cost reward for optimizing the balance between performance and cost, opening a pathway toward enhancing performance-cost trade-offs via RL. Router-R1 also conditions only on simple model descriptors such as pricing, latency, and example performance, enabling strong generalization to unseen model selection. Experiments on seven general and multi-hop QA benchmarks show that Router-R1 outperforms several strong baselines, achieving superior performance while maintaining robust generalization and cost management.
中文: 本文提出Router-R1强化学习框架,通过将多模型路由构建为序列决策过程,动态调用并整合不同大语言模型的优势,在多个基准测试中实现了性能与成本的最优平衡。
English: This paper introduces Router-R1, a reinforcement learning framework that enhances multi-LLM routing by dynamically selecting and aggregating models through sequential decision-making, achieving superior performance and cost efficiency across diverse benchmarks.

Authors:Lei Zhang, Jiaxi Yang, Min Yang, Jian Yang, Mouxiang Chen, Jiajun Zhang, Zeyu Cui, Binyuan Hui, Junyang Lin
Title: SWE-Flow: Synthesizing Software Engineering Data in a Test-Driven Manner
Abstract:
We introduce **SWE-Flow**, a novel data synthesis framework grounded in Test-Driven Development (TDD). Unlike existing software engineering data that rely on human-submitted issues, **SWE-Flow** automatically infers incremental development steps directly from unit tests, which inherently encapsulate high-level requirements. The core of **SWE-Flow** is the construction of a Runtime Dependency Graph (RDG), which precisely captures function interactions, enabling the generation of a structured, step-by-step *development schedule*. At each step, **SWE-Flow** produces a partial codebase, the corresponding unit tests, and the necessary code modifications, resulting in fully verifiable TDD tasks. With this approach, we generated 16,061 training instances and 2,020 test instances from real-world GitHub projects, creating the **SWE-Flow-Eval** benchmark. Our experiments show that fine-tuning open model on this dataset significantly improves performance in TDD-based coding. To facilitate further research, we release all code, datasets, models, and Docker images at [Github](https://github.com/Hambaobao/SWE-Flow).
中文摘要:SWE-Flow是一种基于测试驱动开发的新型数据合成框架,通过单元测试自动生成可验证的开发步骤,创建的SWE-Flow-Eval基准显著提升了AI模型在代码生成任务中的表现。
English Summary: SWE-Flow is a novel TDD-based data synthesis framework that automatically generates verifiable development steps from unit tests, creating the SWE-Flow-Eval benchmark which significantly improves AI coding performance when used for fine-tuning.

Authors:Theo Zhang, Madurya Suresh, Anne S. Warlaumont, Kasia Hitczenko, Alejandrina Cristia, Margaret Cychosz
Title: Employing self-supervised learning models for cross-linguistic child speech maturity classification
Abstract:
Speech technology systems struggle with many downstream tasks for child speech due to small training corpora and the difficulties that child speech pose. We apply a novel dataset, SpeechMaturity, to state-of-the-art transformer models to address a fundamental classification task: identifying child vocalizations. Unlike previous corpora, our dataset captures maximally ecologically-valid child vocalizations across an unprecedented sample, comprising children acquiring 25+ languages in the U.S., Bolivia, Vanuatu, Papua New Guinea, Solomon Islands, and France. The dataset contains 242,004 labeled vocalizations, magnitudes larger than previous work. Models were trained to distinguish between cry, laughter, mature (consonant+vowel), and immature speech (just consonant or vowel). Models trained on the dataset outperform state-of-the-art models trained on previous datasets, achieved classification accuracy comparable to humans, and were robust across rural and urban settings.
中文: 研究人员开发了一个名为SpeechMaturity的大型生态有效数据集,用于训练Transformer模型精确分类儿童发声,实现了与人类相当的分类准确率,并在不同环境中表现出强鲁棒性。
English: Researchers developed a large, ecologically-valid dataset called SpeechMaturity to train transformer models that accurately classify child vocalizations, achieving human-comparable accuracy and robustness across diverse settings.

Authors:Qinggang Zhang, Zhishang Xiang, Yilin Xiao, Le Wang, Junhui Li, Xinrun Wang, Jinsong Su
Title: FaithfulRAG: Fact-Level Conflict Modeling for Context-Faithful Retrieval-Augmented Generation
Abstract:
Large language models (LLMs) augmented with retrieval systems have demonstrated significant potential in handling knowledge-intensive tasks. However, these models often struggle with unfaithfulness issues, generating outputs that either ignore the retrieved context or inconsistently blend it with the LLM`s parametric knowledge. This issue is particularly severe in cases of knowledge conflict, where the retrieved context conflicts with the model`s parametric knowledge. While existing faithful RAG approaches enforce strict context adherence through well-designed prompts or modified decoding strategies, our analysis reveals a critical limitation: they achieve faithfulness by forcibly suppressing the model`s parametric knowledge, which undermines the model`s internal knowledge structure and increases the risk of misinterpreting the context. To this end, this paper proposes FaithfulRAG, a novel framework that resolves knowledge conflicts by explicitly modeling discrepancies between the model`s parametric knowledge and retrieved context. Specifically, FaithfulRAG identifies conflicting knowledge at the fact level and designs a self-thinking process, allowing LLMs to reason about and integrate conflicting facts before generating responses. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. The code is available at https://github.com/DeepLearnXMU/Faithful-RAG
中文: 本文提出FaithfulRAG创新框架,通过显式建模参数化知识与检索上下文间的差异,在生成响应前进行自我推理以整合冲突事实,从而解决检索增强大语言模型中的知识冲突问题。
English: This paper introduces FaithfulRAG, a novel framework that resolves knowledge conflicts in retrieval-augmented LLMs by explicitly modeling discrepancies between parametric knowledge and retrieved context, allowing self-reasoning to integrate conflicting facts before response generation.

Authors:Andrew Shin
Title: Can A Gamer Train A Mathematical Reasoning Model?
Abstract:
While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.
Chinese: 该研究通过强化学习和内存优化技术,在单个游戏显卡上成功训练出性能优异的15亿参数数学推理模型,打破了高性能AI研究必须依赖庞大计算资源的传统模式。
English: This research demonstrates that a single gaming GPU can train a competitive 1.5B-parameter mathematical reasoning model using reinforcement learning and memory optimization, challenging the need for massive computational infrastructure.

Authors:Yuni Susanti, Michael Färber
Title: Paths to Causality: Finding Informative Subgraphs Within Knowledge Graphs for Knowledge-Based Causal Discovery
Abstract:
Inferring causal relationships between variable pairs is crucial for understanding multivariate interactions in complex systems. Knowledge-based causal discovery -- which involves inferring causal relationships by reasoning over the metadata of variables (e.g., names or textual context) -- offers a compelling alternative to traditional methods that rely on observational data. However, existing methods using Large Language Models (LLMs) often produce unstable and inconsistent results, compromising their reliability for causal inference. To address this, we introduce a novel approach that integrates Knowledge Graphs (KGs) with LLMs to enhance knowledge-based causal discovery. Our approach identifies informative metapath-based subgraphs within KGs and further refines the selection of these subgraphs using Learning-to-Rank-based models. The top-ranked subgraphs are then incorporated into zero-shot prompts, improving the effectiveness of LLMs in inferring the causal relationship. Extensive experiments on biomedical and open-domain datasets demonstrate that our method outperforms most baselines by up to 44.4 points in F1 scores, evaluated across diverse LLMs and KGs. Our code and datasets are available on GitHub: https://github.com/susantiyuni/path-to-causality
中文摘要:本研究提出了一种将知识图谱与大型语言模型相结合的新方法,通过元路径子图排序和零样本提示增强基于知识的因果发现,在多个数据集上相比基线方法F1分数最高提升44.4分。
English Summary: This study introduces a novel approach that integrates Knowledge Graphs with Large Language Models to enhance knowledge-based causal discovery, achieving up to 44.4-point F1 score improvements over baselines through metapath subgraph ranking and zero-shot prompting.

Authors:Kongcheng Zhang, Qi Yao, Shunyu Liu, Yingjie Wang, Baisheng Lai, Jieping Ye, Mingli Song, Dacheng Tao
Title: Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning
Abstract:
Recent advances of Reinforcement Learning (RL) have highlighted its potential in complex reasoning tasks, yet effective training often relies on external supervision, which limits the broader applicability. In this work, we propose a novel self-rewarding reinforcement learning framework to enhance Large Language Model (LLM) reasoning by leveraging the consistency of intermediate reasoning states across different reasoning trajectories. Our key insight is that correct responses often exhibit consistent trajectory patterns in terms of model likelihood: their intermediate reasoning states tend to converge toward their own final answers (high consistency) with minimal deviation toward other candidates (low volatility). Inspired by this observation, we introduce CoVo, an intrinsic reward mechanism that integrates Consistency and Volatility via a robust vector-space aggregation strategy, complemented by a curiosity bonus to promote diverse exploration. CoVo enables LLMs to perform RL in a self-rewarding manner, offering a scalable pathway for learning to reason without external supervision. Extensive experiments on diverse reasoning benchmarks show that CoVo achieves performance comparable to or even surpassing supervised RL. Our code is available at https://github.com/sastpg/CoVo.
中文摘要:提出的CoVo框架通过利用推理轨迹的一致性,使大型语言模型能够进行自我奖励的强化学习,无需外部监督即可达到与监督方法相当的性能。
English Summary: The proposed CoVo framework enables self-rewarding reinforcement learning for large language models by leveraging consistency in reasoning trajectories, achieving performance comparable to supervised methods without external supervision.

Authors:Mingyu Zheng, Zhifan Feng, Jia Wang, Lanrui Wang, Zheng Lin, Yang Hao, Weiping Wang
Title: TableDreamer: Progressive and Weakness-guided Data Synthesis from Scratch for Table Instruction Tuning
Abstract:
Despite the commendable progress of recent LLM-based data synthesis methods, they face two limitations in generating table instruction tuning data. First, they can not thoroughly explore the vast input space of table understanding tasks, leading to limited data diversity. Second, they ignore the weaknesses in table understanding ability of the target LLM and blindly pursue the increase of data quantity, resulting in suboptimal data efficiency. In this paper, we introduce a progressive and weakness-guided data synthesis framework tailored for table instruction tuning, named TableDreamer, to mitigate the above issues. Specifically, we first synthesize diverse tables and related instructions as seed data, and then perform an iterative exploration of the input space under the guidance of the newly identified weakness data, which eventually serve as the final training data for fine-tuning the target LLM. Extensive experiments on 10 tabular benchmarks demonstrate the effectiveness of the proposed framework, which boosts the average accuracy of Llama3.1-8B-instruct by 11.62% (49.07% to 60.69%) with 27K GPT-4o synthetic data and outperforms state-of-the-art data synthesis baselines which use more training data. The code and data is available at https://github.com/SpursGoZmy/TableDreamer
中文: TableDreamer提出了一种渐进式、弱点引导的数据合成框架,通过生成多样化数据并迭代修正大语言模型在表格理解中的不足,以少量合成数据显著提升了模型性能。
English: TableDreamer introduces a progressive, weakness-guided framework to enhance table instruction tuning by synthesizing diverse data and iteratively addressing LLM weaknesses, significantly improving accuracy with minimal synthetic data.

Authors:Liyan Xu, Zhenlin Su, Mo Yu, Jiangnan Li, Fandong Meng, Jie Zhou
Title: Dense Retrievers Can Fail on Simple Queries: Revealing The Granularity Dilemma of Embeddings
Abstract:
This work stems from an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within encoded semantics, resulting in failed retrieval even in simple cases. To examine such behaviors, we first introduce a new evaluation dataset, CapRetrieval, in which passages are image captions and queries are phrases targeting entity or event concepts in diverse forms. Zero-shot evaluation suggests that encoders often struggle with these fine-grained matching, regardless of training sources or model size. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, enabling a small 0.1B encoder to outperform the state-of-the-art 7B model. Within this process, we further uncover the granularity dilemma, a challenge for embeddings to capture fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.
中文: 本研究针对文本编码器在识别细粒度实体和事件方面的局限性,通过引入CapRetrieval数据集并提出数据生成策略,使小型编码器性能超越大型模型,同时揭示了嵌入语义中的粒度困境。
English: This study addresses the limitation of text encoders in recognizing fine-grained entities and events by introducing the CapRetrieval dataset and proposing data generation strategies that enable a small encoder to outperform larger models, while also identifying the granularity dilemma in embedding semantics.

Authors:Xiao Wei, Xiaobao Wang, Ning Zhuang, Chenyang Wang, Longbiao Wang, Jianwu dang
Title: Integration of Old and New Knowledge for Generalized Intent Discovery: A Consistency-driven Prototype-Prompting Framework
Abstract:
Intent detection aims to identify user intents from natural language inputs, where supervised methods rely heavily on labeled in-domain (IND) data and struggle with out-of-domain (OOD) intents, limiting their practical applicability. Generalized Intent Discovery (GID) addresses this by leveraging unlabeled OOD data to discover new intents without additional annotation. However, existing methods focus solely on clustering unsupervised data while neglecting domain adaptation. Therefore, we propose a consistency-driven prototype-prompting framework for GID from the perspective of integrating old and new knowledge, which includes a prototype-prompting framework for transferring old knowledge from external sources, and a hierarchical consistency constraint for learning new knowledge from target domains. We conducted extensive experiments and the results show that our method significantly outperforms all baseline methods, achieving state-of-the-art results, which strongly demonstrates the effectiveness and generalization of our methods. Our source code is publicly available at https://github.com/smileix/cpp.
Chinese: 针对广义意图发现提出的基于一致性的原型提示框架,通过原型提示传递外部知识和分层一致性约束学习目标领域知识,实现了无需额外标注的新意图发现,并取得了最先进的性能表现。
English: The proposed consistency-driven prototype-prompting framework for Generalized Intent Discovery effectively integrates old and new knowledge through prototype prompting and hierarchical consistency constraints, achieving state-of-the-art performance in discovering new intents without additional annotation.

Authors:Jiaxiang Liu, Boxuan Xing, Chenhao Yuan, Chenxiang Zhang, Di Wu, Xiusheng Huang, Haida Yu, Chuhan Lang, Pengfei Cao, Jun Zhao, Kang Liu
Title: Know-MRI: A Knowledge Mechanisms Revealer&Interpreter for Large Language Models
Abstract:
As large language models (LLMs) continue to advance, there is a growing urgency to enhance the interpretability of their internal knowledge mechanisms. Consequently, many interpretation methods have emerged, aiming to unravel the knowledge mechanisms of LLMs from various perspectives. However, current interpretation methods differ in input data formats and interpreting outputs. The tools integrating these methods are only capable of supporting tasks with specific inputs, significantly constraining their practical applications. To address these challenges, we present an open-source Knowledge Mechanisms Revealer&Interpreter (Know-MRI) designed to analyze the knowledge mechanisms within LLMs systematically. Specifically, we have developed an extensible core module that can automatically match different input data with interpretation methods and consolidate the interpreting outputs. It enables users to freely choose appropriate interpretation methods based on the inputs, making it easier to comprehensively diagnose the model's internal knowledge mechanisms from multiple perspectives. Our code is available at https://github.com/nlpkeg/Know-MRI. We also provide a demonstration video on https://youtu.be/NVWZABJ43Bs.
Chinese: 针对当前大语言模型解释方法存在的局限性,我们开发了开源工具Know-MRI,其可扩展核心能自动匹配输入与解释方法并整合输出结果,从而支持从多角度全面诊断模型的内部知识机制。
English: To address the limitations of current interpretation methods for large language models, we introduce Know-MRI, an open-source tool with an extensible core that automatically matches inputs with interpretation methods and integrates outputs for comprehensive analysis of internal knowledge mechanisms.

Authors:Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Hu Song, Linfeng Zhang
Title: TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration
Abstract:
Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC.
中文: TACTIC框架提出了一种基于认知理论的多智能体翻译系统,通过模拟人类译者的策略分工协作,在多项语言基准测试中实现了最优性能表现。
English: The TACTIC framework introduces a cognitively inspired multi-agent system that simulates human translation strategies, achieving state-of-the-art performance by integrating specialized agents for drafting, refinement, and evaluation across diverse language benchmarks.

Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang
Title: Reinforcement Learning Teachers of Test Time Scaling
Abstract:
Training reasoning language models (LMs) with reinforcement learning (RL) for one-hot correctness inherently relies on the LM being able to explore and solve its task with some chance at initialization. Furthermore, a key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations rather than being deployed themselves. From these considerations, we introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs) focused on yielding the most effective downstream distillation. RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students. We train RLTs with dense rewards obtained by feeding each explanation to the student and testing its understanding of the problem's solution. In practice, the raw outputs of a 7B RLT provide higher final performance on competition and graduate-level tasks than existing distillation and cold-starting pipelines that collect and postprocess the reasoning traces of orders of magnitude larger LMs. Furthermore, RLTs maintain their effectiveness when training larger students and when applied zero-shot to out-of-distribution tasks, unlocking new levels of efficiency and re-usability for the RL reasoning framework.
中文摘要:本文提出强化学习教师(RLT)框架,通过训练模型生成针对下游学生的详细解释来规避强化学习的探索难题,相比传统方法能以更小模型实现更优性能,并保持跨任务的有效性。
English Summary: The paper introduces Reinforcement-Learned Teachers (RLTs), a framework that trains reasoning models to generate detailed explanations for distillation, overcoming exploration challenges in RL while achieving superior performance with smaller models compared to traditional methods.

Authors:Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee
Title: Draft-based Approximate Inference for LLMs
Abstract:
Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a novel framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Specifically, we introduce two instantiations of our proposed framework: (i) SpecKV, the first method that leverages a draft output to accurately assess the importance of each KV pair for more effective KV cache dropping, and (ii) SpecPC, which uses the draft model's attention activations to identify and discard unimportant prompt tokens. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput. Our code is available at https://github.com/furiosa-ai/draft-based-approx-llm.
Chinese Summary: 本文提出了一种新颖的框架,利用小型草稿模型精确预测令牌和KV对的重要性,从而提升长上下文大语言模型推理效率,在保持内存使用、延迟和吞吐量改进的同时,相比现有方法实现了更高的准确性。
English Summary: This paper introduces a novel framework that uses small draft models to enhance the efficiency of long-context LLM inference by accurately predicting token and KV pair importance, leading to improved accuracy and performance in memory usage, latency, and throughput compared to existing methods.

Authors:Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, Bo Han
Title: From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?
Abstract:
While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast, active reasoning-where an LLM must interact with external systems to acquire missing evidence or data-has received little systematic attention. To address this shortfall, we present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills. AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers-that together simulate real-world, agentic scenarios and measure performance across commonsense, logical, and symbolic reasoning challenges. Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning: they frequently fail to acquire or leverage the information needed to solve tasks. This gap highlights a stark divergence between their passive and active reasoning abilities. Moreover, ablation studies indicate that even advanced strategies, such as tree-based searching or post-training approaches, yield only modest gains and fall short of the levels required for real-world deployment. Collectively, these findings highlight the critical need to advance methodology for active reasoning, e.g., incorporating interactive learning, real-time feedback loops, and environment-aware objectives for training. The benchmark is publicly available at: https://github.com/tmlr-group/AR-Bench.
中文摘要:AR-Bench是一个专为评估大语言模型主动推理能力设计的新基准,揭示了模型在获取和利用外部信息方面相比被动推理存在显著困难,并强调了改进交互学习等方法的必要性。
English Summary: AR-Bench is a new benchmark designed to evaluate large language models' active reasoning skills, revealing their significant struggles in acquiring and using external information compared to passive reasoning, and highlighting the need for improved methodologies like interactive learning.

Authors:Xie Yi, Zhanke Zhou, Chentao Cao, Qiyu Niu, Tongliang Liu, Bo Han
Title: From Debate to Equilibrium: Belief-Driven Multi-Agent LLM Reasoning via Bayesian Nash Equilibrium
Abstract:
Multi-agent frameworks can substantially boost the reasoning power of large language models (LLMs), but they typically incur heavy computational costs and lack convergence guarantees. To overcome these challenges, we recast multi-LLM coordination as an incomplete-information game and seek a Bayesian Nash equilibrium (BNE), in which each agent optimally responds to its probabilistic beliefs about the strategies of others. We introduce Efficient Coordination via Nash Equilibrium (ECON), a hierarchical reinforcement-learning paradigm that marries distributed reasoning with centralized final output. Under ECON, each LLM independently selects responses that maximize its expected reward, conditioned on its beliefs about co-agents, without requiring costly inter-agent exchanges. We mathematically prove that ECON attains a markedly tighter regret bound than non-equilibrium multi-agent schemes. Empirically, ECON outperforms existing multi-LLM approaches by 11.2% on average across six benchmarks spanning complex reasoning and planning tasks. Further experiments demonstrate ECON's ability to flexibly incorporate additional models, confirming its scalability and paving the way toward larger, more powerful multi-LLM ensembles. The code is publicly available at: https://github.com/tmlr-group/ECON.
Chinese: ECON通过将多智能体协调建模为不完全信息博弈,提出了一种结合分布式推理与集中输出的分层强化学习范式,在显著降低通信成本的同时实现了更强的性能与理论保障。
English: ECON introduces a hierarchical reinforcement-learning framework that models multi-LLM coordination as an incomplete-information game, achieving superior performance with a tighter regret bound and eliminating costly inter-agent communication.

Authors:Subba Reddy Oota, Khushbu Pahwa, Prachi Jindal, Satya Sai Srinath Namburi, Maneesh Singh, Tanmoy Chakraborty, Bapi S. Raju, Manish Gupta
Title: Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in the Brain
Abstract:
Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models in both unimodal and multimodal stimulus settings. More recently, instruction-tuned multimodal models have shown to generate task-specific representations that align strongly with brain activity. However, prior work evaluating the brain alignment of MLLMs has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigated brain alignment, that is, measuring the degree of predictivity of neural activity recorded while participants were watching naturalistic movies (video along with audio) with representations derived from MLLMs. We utilized instruction-specific embeddings from six video and two audio instruction-tuned MLLMs. Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in improving the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both the systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].
中文: 指令调优的多模态大语言模型在自然观影场景中显著优于非指令调优模型,其任务特定表征能精确对应大脑层级处理机制,为大脑与计算系统的映射研究开辟了新途径。
English: Instruction-tuned multimodal large language models significantly outperform non-instruction-tuned models in aligning with brain activity during naturalistic movie viewing, showing hierarchical layer correspondence and task-specific representation disentanglement that advances brain-computation mapping.

Authors:Yu-Ang Lee, Guan-Ting Yi, Mei-Yi Liu, Jui-Chao Lu, Guan-Bo Yang, Yun-Nung Chen
Title: Compound AI Systems Optimization: A Survey of Methods, Challenges, and Future Directions
Abstract:
Recent advancements in large language models (LLMs) and AI systems have led to a paradigm shift in the design and optimization of complex AI workflows. By integrating multiple components, compound AI systems have become increasingly adept at performing sophisticated tasks. However, as these systems grow in complexity, new challenges arise in optimizing not only individual components but also their interactions. While traditional optimization methods such as supervised fine-tuning (SFT) and reinforcement learning (RL) remain foundational, the rise of natural language feedback introduces promising new approaches, especially for optimizing non-differentiable systems. This paper provides a systematic review of recent progress in optimizing compound AI systems, encompassing both numerical and language-based techniques. We formalize the notion of compound AI system optimization, classify existing methods along several key dimensions, and highlight open research challenges and future directions in this rapidly evolving field. A list of surveyed papers is publicly available at https://github.com/MiuLab/AISysOpt-Survey.
中文:大型语言模型和人工智能系统的进展推动了复合AI系统的发展,这些系统通过整合多个组件处理复杂任务,但其日益增长的复杂性带来了优化单个组件及其交互的挑战,因此本文系统回顾了包括传统技术和新兴自然语言反馈方法在内的优化策略。
English: Recent progress in large language models and AI systems has advanced compound AI systems, which integrate multiple components to handle complex tasks, yet their growing complexity poses challenges in optimizing both individual elements and their interactions, prompting a systematic review of optimization methods including traditional techniques and emerging natural language feedback approaches.

Authors:Chupei Wang, Jiaqiu Vince Sun
Title: Unable to Forget: Proactive Interference Reveals Working Memory Limits in LLMs Beyond Context Length
Abstract:
Information retrieval in Large Language Models (LLMs) is increasingly recognized as intertwined with generation capabilities rather than mere lookup. While longer contexts are often assumed to improve retrieval, the effects of intra-context interference remain understudied. To address this, we adapt the proactive interference (PI) paradigm from cognitive science, where earlier information disrupts recall of newer updates. In humans, susceptibility to such interference is inversely linked to working memory capacity. We introduce PI-LLM, an evaluation that sequentially streams semantically related key-value updates and queries only the final values. Although these final values are clearly positioned just before the query, LLM retrieval accuracy declines log-linearly toward zero as interference accumulates; errors arise from retrieving previously overwritten values. Attempts to mitigate interference via prompt engineering (e.g., instructing models to ignore earlier input) yield limited success. These findings reveal a fundamental constraint on LLMs' ability to disentangle interference and flexibly manipulate information, suggesting a working memory bottleneck beyond mere context access. This calls for approaches that strengthen models' ability to suppress irrelevant content during retrieval.
中文总结:该研究发现大型语言模型存在前摄干扰问题,即先前的信息会干扰对后续更新的回忆,即使目标数据位置明确,随着干扰累积,检索准确率仍会大幅下降。
English Summary: The study reveals that large language models suffer from proactive interference, where earlier information disrupts recall of recent updates, causing retrieval accuracy to decline significantly as interference accumulates despite clear positioning of target data.

Authors:Lijing Zhu, Qizhen Lan, Qing Tian, Wenbo Sun, Li Yang, Lu Xia, Yixin Xie, Xi Xiao, Tiehang Duan, Cui Tao, Shuteng Niu
Title: ETT-CKGE: Efficient Task-driven Tokens for Continual Knowledge Graph Embedding
Abstract:
Continual Knowledge Graph Embedding (CKGE) seeks to integrate new knowledge while preserving past information. However, existing methods struggle with efficiency and scalability due to two key limitations: (1) suboptimal knowledge preservation between snapshots caused by manually designed node/relation importance scores that ignore graph dependencies relevant to the downstream task, and (2) computationally expensive graph traversal for node/relation importance calculation, leading to slow training and high memory overhead. To address these limitations, we introduce ETT-CKGE (Efficient, Task-driven, Tokens for Continual Knowledge Graph Embedding), a novel task-guided CKGE method that leverages efficient task-driven tokens for efficient and effective knowledge transfer between snapshots. Our method introduces a set of learnable tokens that directly capture task-relevant signals, eliminating the need for explicit node scoring or traversal. These tokens serve as consistent and reusable guidance across snapshots, enabling efficient token-masked embedding alignment between snapshots. Importantly, knowledge transfer is achieved through simple matrix operations, significantly reducing training time and memory usage. Extensive experiments across six benchmark datasets demonstrate that ETT-CKGE consistently achieves superior or competitive predictive performance, while substantially improving training efficiency and scalability compared to state-of-the-art CKGE methods. The code is available at: https://github.com/lijingzhu1/ETT-CKGE/tree/main
中文: ETT-CKGE通过引入可学习的任务驱动标记,利用简单矩阵运算实现快照间高效知识迁移,在显著提升训练效率和可扩展性的同时,取得了优于或媲美现有方法的预测性能。
English: ETT-CKGE introduces learnable task-driven tokens to enable efficient knowledge transfer between snapshots through simple matrix operations, achieving superior performance while significantly improving training efficiency and scalability compared to existing methods.

Authors:Abdellah Ghassel, Ian Robinson, Gabriel Tanase, Hal Cooper, Bryan Thompson, Zhen Han, Vassilis N. Ioannidis, Soji Adeshina, Huzefa Rangwala
Title: Hierarchical Lexical Graph for Enhanced Multi-Hop Retrieval
Abstract:
Retrieval-Augmented Generation (RAG) grounds large language models in external evidence, yet it still falters when answers must be pieced together across semantically distant documents. We close this gap with the Hierarchical Lexical Graph (HLG), a three-tier index that (i) traces every atomic proposition to its source, (ii) clusters propositions into latent topics, and (iii) links entities and relations to expose cross-document paths. On top of HLG we build two complementary, plug-and-play retrievers: StatementGraphRAG, which performs fine-grained entity-aware beam search over propositions for high-precision factoid questions, and TopicGraphRAG, which selects coarse topics before expanding along entity links to supply broad yet relevant context for exploratory queries. Additionally, existing benchmarks lack the complexity required to rigorously evaluate multi-hop summarization systems, often focusing on single-document queries or limited datasets. To address this, we introduce a synthetic dataset generation pipeline that curates realistic, multi-document question-answer pairs, enabling robust evaluation of multi-hop retrieval systems. Extensive experiments across five datasets demonstrate that our methods outperform naive chunk-based RAG achieving an average relative improvement of 23.1% in retrieval recall and correctness. Open-source Python library is available at https://github.com/awslabs/graphrag-toolkit.
中文: 分层词汇图(HLG)通过构建三层索引和两种互补检索器,解决了RAG在跨语义分散文档整合信息时的不足,在多个数据集上平均将检索性能提升了23.1%。
English: The Hierarchical Lexical Graph (HLG) addresses RAG's limitations in connecting information across distant documents by creating a three-tier index and two complementary retrievers, significantly improving retrieval performance by 23.1% on average across multiple datasets.

Authors:Yunfei Xie, Yinsong Ma, Shiyi Lan, Alan Yuille, Junfei Xiao, Chen Wei
Title: Play to Generalize: Learning to Reason Through Game Play
Abstract:
Developing generalizable reasoning capabilities in multimodal large language models (MLLMs) remains challenging. Motivated by cognitive science literature suggesting that gameplay promotes transferable cognitive skills, we propose a novel post-training paradigm, Visual Game Learning, or ViGaL, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games. Specifically, we show that post-training a 7B-parameter MLLM via reinforcement learning (RL) on simple arcade-like games, e.g. Snake, significantly enhances its downstream performance on multimodal math benchmarks like MathVista, and on multi-discipline questions like MMMU, without seeing any worked solutions, equations, or diagrams during RL, suggesting the capture of transferable reasoning skills. Remarkably, our model outperforms specialist models tuned on multimodal reasoning data in multimodal reasoning benchmarks, while preserving the base model's performance on general visual benchmarks, a challenge where specialist models often fall short. Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks that unlock generalizable multimodal reasoning abilities in MLLMs.

Authors:Jiayi Sheng, Luna Lyu, Jikai Jin, Tony Xia, Alex Gu, James Zou, Pan Lu
Title: Solving Inequality Proofs with Large Language Models
Abstract:
Inequality proving, crucial across diverse scientific and mathematical fields, tests advanced reasoning skills such as discovering tight bounds and strategic theorem application. This makes it a distinct, demanding frontier for large language models (LLMs), offering insights beyond general mathematical problem-solving. Progress in this area is hampered by existing datasets that are often scarce, synthetic, or rigidly formal. We address this by proposing an informal yet verifiable task formulation, recasting inequality proving into two automatically checkable subtasks: bound estimation and relation prediction. Building on this, we release IneqMath, an expert-curated dataset of Olympiad-level inequalities, including a test set and training corpus enriched with step-wise solutions and theorem annotations. We also develop a novel LLM-as-judge evaluation framework, combining a final-answer judge with four step-wise judges designed to detect common reasoning flaws. A systematic evaluation of 29 leading LLMs on IneqMath reveals a surprising reality: even top models like o1 achieve less than 10% overall accuracy under step-wise scrutiny; this is a drop of up to 65.5% from their accuracy considering only final answer equivalence. This discrepancy exposes fragile deductive chains and a critical gap for current LLMs between merely finding an answer and constructing a rigorous proof. Scaling model size and increasing test-time computation yield limited gains in overall proof correctness. Instead, our findings highlight promising research directions such as theorem-guided reasoning and self-refinement. Code and data are available at https://ineqmath.github.io/.

Authors:Michael K. Chen, Xikun Zhang, Jiaxing Huang, Dacheng Tao
Title: Improving Large Language Models with Concept-Aware Fine-Tuning
Abstract:
Large language models (LLMs) have become the cornerstone of modern AI. However, the existing paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Take the phrase "ribonucleic acid" as an example: an LLM will first decompose it into tokens, i.e., artificial text fragments ("rib", "on", ...), then learn each token sequentially, rather than grasping the phrase as a unified, coherent semantic entity. This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems. In response, we introduce Concept-Aware Fine-Tuning (CAFT), a novel multi-token training method that redefines how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token finetuning methods across diverse tasks, including traditional applications like text summarization and domain-specific ones like de novo protein design. Multi-token prediction was previously only possible in the prohibitively expensive pretraining phase; CAFT, to our knowledge, is the first to bring the multi-token setting to the post-training phase, thus effectively democratizing its benefits for the broader community of practitioners and researchers. Finally, the unexpected effectiveness of our proposed method suggests wider implications for the machine learning research community. All code and data are available at https://github.com/michaelchen-lab/caft-llm
Chinese: 大语言模型受限于逐词预测,但新提出的概念感知微调方法通过多词学习增强了概念理解能力,在多项任务中表现优异。
English: Large language models are limited by next-token prediction, but the new Concept-Aware Fine-Tuning method enables multi-token learning for improved conceptual understanding across various tasks.

Authors:Adam Breuer
Title: E-LDA: Toward Interpretable LDA Topic Models with Strong Guarantees in Logarithmic Parallel Time
Abstract:
In this paper, we provide the first practical algorithms with provable guarantees for the problem of inferring the topics assigned to each document in an LDA topic model. This is the primary inference problem for many applications of topic models in social science, data exploration, and causal inference settings. We obtain this result by showing a novel non-gradient-based, combinatorial approach to estimating topic models. This yields algorithms that converge to near-optimal posterior probability in logarithmic parallel computation time (adaptivity) -- exponentially faster than any known LDA algorithm. We also show that our approach can provide interpretability guarantees such that each learned topic is formally associated with a known keyword. Finally, we show that unlike alternatives, our approach can maintain the independence assumptions necessary to use the learned topic model for downstream causal inference methods that allow researchers to study topics as treatments. In terms of practical performance, our approach consistently returns solutions of higher semantic quality than solutions from state-of-the-art LDA algorithms, neural topic models, and LLM-based topic models across a diverse range of text datasets and evaluation parameters.
Chinese: 本文首次提出具有可证明保证的LDA主题推断实用算法,通过创新组合方法实现指数级加速,同时保持可解释性以支持下游因果分析。
English: This paper introduces the first practical algorithms with provable guarantees for LDA topic inference, using a novel combinatorial approach that achieves exponential speedup and maintains interpretability for downstream causal analysis.

Authors:Mengyang Qiu, Tran Minh Nguyen, Zihao Huang, Zelong Li, Yang Gu, Qingyu Gao, Siliang Liu, Jungyeul Park
Title: Multilingual Grammatical Error Annotation: Combining Language-Agnostic Framework with Language-Specific Flexibility
Abstract:
Grammatical Error Correction (GEC) relies on accurate error annotation and evaluation, yet existing frameworks, such as $\texttt{errant}$, face limitations when extended to typologically diverse languages. In this paper, we introduce a standardized, modular framework for multilingual grammatical error annotation. Our approach combines a language-agnostic foundation with structured language-specific extensions, enabling both consistency and flexibility across languages. We reimplement $\texttt{errant}$ using $\texttt{stanza}$ to support broader multilingual coverage, and demonstrate the framework's adaptability through applications to English, German, Czech, Korean, and Chinese, ranging from general-purpose annotation to more customized linguistic refinements. This work supports scalable and interpretable GEC annotation across languages and promotes more consistent evaluation in multilingual settings. The complete codebase and annotation tools can be accessed at https://github.com/open-writing-evaluation/jp_errant_bea.
中文: 本文提出了一种标准化的模块化多语言语法错误标注框架,结合语言无关的基础和特定语言扩展,在英语、德语、捷克语、韩语和汉语等多种语言中实现了标注的一致性与灵活性。
English: This paper introduces a standardized, modular framework for multilingual grammatical error annotation that combines language-agnostic foundations with language-specific extensions, enhancing consistency and flexibility across diverse languages including English, German, Czech, Korean, and Chinese.

Authors:Jiaming Li, Haoran Ye, Yukun Chen, Xinyue Li, Lei Zhang, Hamid Alinejad-Rokny, Jimmy Chih-Hsien Peng, Min Yang
Title: Training Superior Sparse Autoencoders for Instruct Models
Abstract:
As large language models (LLMs) grow in scale and capability, understanding their internal mechanisms becomes increasingly critical. Sparse autoencoders (SAEs) have emerged as a key tool in mechanistic interpretability, enabling the extraction of human-interpretable features from LLMs. However, existing SAE training methods are primarily designed for base models, resulting in reduced reconstruction quality and interpretability when applied to instruct models. To bridge this gap, we propose $\underline{\textbf{F}}$inetuning-$\underline{\textbf{a}}$ligned $\underline{\textbf{S}}$equential $\underline{\textbf{T}}$raining ($\textit{FAST}$), a novel training method specifically tailored for instruct models. $\textit{FAST}$ aligns the training process with the data distribution and activation patterns characteristic of instruct models, resulting in substantial improvements in both reconstruction and feature interpretability. On Qwen2.5-7B-Instruct, $\textit{FAST}$ achieves a mean squared error of 0.6468 in token reconstruction, significantly outperforming baseline methods with errors of 5.1985 and 1.5096. In feature interpretability, $\textit{FAST}$ yields a higher proportion of high-quality features, for Llama3.2-3B-Instruct, $21.1\%$ scored in the top range, compared to $7.0\%$ and $10.2\%$ for $\textit{BT(P)}$ and $\textit{BT(F)}$. Surprisingly, we discover that intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior. Code, data, and 240 trained SAEs are available at https://github.com/Geaming2002/FAST.
中文摘要:FAST方法通过将训练过程与指令模型的数据分布和激活模式对齐,显著提升了稀疏自编码器在指令模型上的重建质量和特征可解释性,同时揭示了通过特殊令牌干预来精细调控模型行为的新途径。
English Summary: The proposed FAST method significantly enhances sparse autoencoder performance for instruct models by aligning training with their data distribution and activation patterns, achieving superior reconstruction accuracy and feature interpretability while revealing new opportunities for model behavior control.

Authors:Lei Xu, Sirui Chen, Yuxuan Huang, Chaochao Lu
Title: Synthesis by Design: Controlled Data Generation via Structural Guidance
Abstract:
Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation quality and problem complexity. To address this, we propose to extract structural information with generated problem-solving code from mathematical reasoning and guide data generation with structured solutions. Applied to MATH and GSM8K, our approach produces 39K problems with labeled intermediate steps and a 6.1K-problem benchmark of higher difficulty. Results on our benchmark show that model performance declines as reasoning length increases. Additionally, we conducted fine-tuning experiments using the proposed training data on a range of LLMs, and the results validate the effectiveness of our dataset. We hope the proposed method and dataset will contribute to future research in enhancing LLM reasoning capabilities. Our code and data are available at https://github.com/OpenCausaLab/StructuralGeneration.
中文摘要:本研究提出通过生成带标注步骤的结构化数据来增强大语言模型的数学推理能力,创建了包含3.9万问题数据集和6.1千问题的高难度基准测试,实验表明模型性能随推理链增长而下降,微调结果验证了该数据集的有效性。
English Summary: This study introduces a method to improve LLM mathematical reasoning by generating structured data with labeled steps, creating a 39K-problem dataset and a 6.1K-problem benchmark that shows performance declines with longer reasoning chains, with fine-tuning experiments confirming the dataset's effectiveness.

Authors:Yuan Chang, Ziyue Li, Hengyuan Zhang, Yuanbo Kong, Yanru Wu, Hayden Kwok-Hay So, Zhijiang Guo, Liya Zhu, Ngai Wong
Title: TreeReview: A Dynamic Tree of Questions Framework for Deep and Efficient LLM-based Scientific Peer Review
Abstract:
While Large Language Models (LLMs) have shown significant potential in assisting peer review, current methods often struggle to generate thorough and insightful reviews while maintaining efficiency. In this paper, we propose TreeReview, a novel framework that models paper review as a hierarchical and bidirectional question-answering process. TreeReview first constructs a tree of review questions by recursively decomposing high-level questions into fine-grained sub-questions and then resolves the question tree by iteratively aggregating answers from leaf to root to get the final review. Crucially, we incorporate a dynamic question expansion mechanism to enable deeper probing by generating follow-up questions when needed. We construct a benchmark derived from ICLR and NeurIPS venues to evaluate our method on full review generation and actionable feedback comments generation tasks. Experimental results of both LLM-based and human evaluation show that TreeReview outperforms strong baselines in providing comprehensive, in-depth, and expert-aligned review feedback, while reducing LLM token usage by up to 80% compared to computationally intensive approaches. Our code and benchmark dataset are available at https://github.com/YuanChang98/tree-review.
中文摘要:TreeReview框架通过将论文评审建模为分层双向问答过程,在提升评审深度和全面性的同时,比传统方法减少高达80%的计算资源消耗。
English Summary: TreeReview is a hierarchical framework that enhances peer review efficiency by decomposing it into a bidirectional question-answering process, achieving comprehensive feedback while reducing computational costs by up to 80%.

Authors:Brian Gordon, Yonatan Bitton, Andreea Marzoca, Yasumasa Onoe, Xiao Wang, Daniel Cohen-Or, Idan Szpektor
Title: Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline
Abstract:
Large Vision-Language Models (VLMs) now generate highly detailed, paragraphlength image captions, yet evaluating their factual accuracy remains challenging. Current methods often miss fine-grained errors, being designed for shorter texts or lacking datasets with verified inaccuracies. We introduce DOCCI-Critique, a benchmark with 1,400 VLM-generated paragraph captions (100 images, 14 VLMs) featuring over 10,216 sentence-level human annotations of factual correctness and explanatory rationales for errors, all within paragraph context. Building on this, we develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation. We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark and strong results in CHOCOLATE claim verification. (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments (e.g., 0.98 Spearman). (3) An innovative Critic-and-Revise pipeline, where critiques from VNLI-Critique guide LLM-based corrections, achieves substantial improvements in caption factuality (e.g., a 46% gain on DetailCaps-4870). Our work offers a crucial benchmark alongside practical tools, designed to significantly elevate the standards for fine-grained evaluation and foster the improvement of VLM image understanding. Project page: https://google.github.io/unblocking-detail-caption

Authors:Roman Kyslyi, Yuliia Maksymiuk, Ihor Pysmennyi
Title: Vuyko Mistral: Adapting LLMs for Low-Resource Dialectal Translation
Abstract:
In this paper we introduce the first effort to adapt large language models (LLMs) to the Ukrainian dialect (in our case Hutsul), a low-resource and morphologically complex dialect spoken in the Carpathian Highlands. We created a parallel corpus of 9852 dialect-to-standard Ukrainian sentence pairs and a dictionary of 7320 dialectal word mappings. We also addressed data shortage by proposing an advanced Retrieval-Augmented Generation (RAG) pipeline to generate synthetic parallel translation pairs, expanding the corpus with 52142 examples. We have fine-tuned multiple open-source LLMs using LoRA and evaluated them on a standard-to-dialect translation task, also comparing with few-shot GPT-4o translation. In the absence of human annotators, we adopt a multi-metric evaluation strategy combining BLEU, chrF++, TER, and LLM-based judgment (GPT-4o). The results show that even small(7B) finetuned models outperform zero-shot baselines such as GPT-4o across both automatic and LLM-evaluated metrics. All data, models, and code are publicly released at: https://github.com/woters/vuyko-hutsul
本文首次将大语言模型适配于低资源的胡楚尔方言,通过构建平行语料库和采用检索增强生成技术扩充数据,实验表明微调后的小型模型在翻译任务中优于GPT-4o。
This paper presents the first adaptation of large language models to the low-resource Hutsul dialect by creating parallel datasets and employing a RAG pipeline for data augmentation, with fine-tuned small models outperforming GPT-4o in translation tasks.

Authors:Mengsong Wu, Di Zhang, Yuqiang Li, Dongzhan Zhou, Wenliang Chen
Title: SELT: Self-Evaluation Tree Search for LLMs with Task Decomposition
Abstract:
While Large Language Models (LLMs) have achieved remarkable success in a wide range of applications, their performance often degrades in complex reasoning tasks. In this work, we introduce SELT (Self-Evaluation LLM Tree Search), a novel framework that leverages a modified Monte Carlo Tree Search (MCTS) to enhance LLM reasoning without relying on external reward models. By redefining the Upper Confidence Bound scoring to align with intrinsic self-evaluation capabilities of LLMs and decomposing the inference process into atomic subtasks augmented with semantic clustering at each node, SELT effectively balances exploration and exploitation, reduces redundant reasoning paths, and mitigates hallucination. We validate our approach on challenging benchmarks, including the knowledge-based MMLU and the Tool Learning dataset Seal-Tools, where SELT achieves significant improvements in answer accuracy and reasoning robustness compared to baseline methods. Notably, our framework operates without task-specific fine-tuning, demonstrating strong generalizability across diverse reasoning tasks. Relevant results and code are available at https://github.com/fairyshine/SELT .
中文: SELT是一种创新框架,通过将自评估与改进的蒙特卡洛树搜索相结合,无需任务特定微调即可提升大语言模型在复杂推理任务中的准确性和鲁棒性。
English: SELT is a novel framework that enhances LLM reasoning by integrating self-evaluation with a modified Monte Carlo Tree Search, improving accuracy and robustness across complex tasks without task-specific fine-tuning.

Authors:Mengsong Wu, YaFei Wang, Yidong Ming, Yuqi An, Yuwei Wan, Wenliang Chen, Binbin Lin, Yuqiang Li, Tong Xie, Dongzhan Zhou
Title: CheMatAgent: Enhancing LLMs for Chemistry and Materials Science through Tree-Search Based Tool Learning
Abstract:
Large language models (LLMs) have recently demonstrated promising capabilities in chemistry tasks while still facing challenges due to outdated pretraining knowledge and the difficulty of incorporating specialized chemical expertise. To address these issues, we propose an LLM-based agent that synergistically integrates 137 external chemical tools created ranging from basic information retrieval to complex reaction predictions, and a dataset curation pipeline to generate the dataset ChemToolBench that facilitates both effective tool selection and precise parameter filling during fine-tuning and evaluation. We introduce a Hierarchical Evolutionary Monte Carlo Tree Search (HE-MCTS) framework, enabling independent optimization of tool planning and execution. By leveraging self-generated data, our approach supports step-level fine-tuning (FT) of the policy model and training task-adaptive PRM and ORM that surpass GPT-4o. Experimental evaluations demonstrate that our approach significantly improves performance in Chemistry QA and discovery tasks, offering a robust solution to integrate specialized tools with LLMs for advanced chemical applications. All datasets and code are available at https://github.com/AI4Chem/ChemistryAgent .
中文摘要:本研究提出了一种基于大语言模型的化学智能体,通过整合137种专业工具和新型分层进化蒙特卡洛树搜索框架,实现了化学问答与发现任务的性能突破,有效解决了专业工具与大模型融合的挑战。
English Summary: This study introduces a chemistry-focused LLM agent that integrates 137 specialized tools and a novel HE-MCTS framework to significantly enhance chemical QA and discovery tasks through optimized tool utilization and self-improving data generation.

Authors:Solee Im, Wonjun Lee, Jinmyeong An, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Title: DeRAGEC: Denoising Named Entity Candidates with Synthetic Rationale for ASR Error Correction
Abstract:
We present DeRAGEC, a method for improving Named Entity (NE) correction in Automatic Speech Recognition (ASR) systems. By extending the Retrieval-Augmented Generative Error Correction (RAGEC) framework, DeRAGEC employs synthetic denoising rationales to filter out noisy NE candidates before correction. By leveraging phonetic similarity and augmented definitions, it refines noisy retrieved NEs using in-context learning, requiring no additional training. Experimental results on CommonVoice and STOP datasets show significant improvements in Word Error Rate (WER) and NE hit ratio, outperforming baseline ASR and RAGEC methods. Specifically, we achieved a 28% relative reduction in WER compared to ASR without postprocessing. Our source code is publicly available at: https://github.com/solee0022/deragec
Chinese: DeRAGEC通过合成去噪原理过滤噪声命名实体候选,并利用语音相似性和上下文学习进行优化,无需额外训练即可显著降低ASR系统的词错率,实现28%的相对改进。
English: DeRAGEC enhances Named Entity correction in ASR systems by filtering noisy candidates with synthetic denoising rationales and refining them through phonetic similarity and in-context learning, achieving a 28% relative WER reduction without extra training.

Authors:Haotian Guo, Jing Han, Yongfeng Tu, Shihao Gao, Shengfan Shen, Wulong Xiang, Weihao Gan, Zixing Zhang
Title: DEBATE: A Dataset for Disentangling Textual Ambiguity in Mandarin Through Speech
Abstract:
Despite extensive research on textual and visual disambiguation, disambiguation through speech (DTS) remains underexplored. This is largely due to the lack of high-quality datasets that pair spoken sentences with richly ambiguous text. To address this gap, we present DEBATE, a unique public Chinese speech-text dataset designed to study how speech cues and patterns-pronunciation, pause, stress and intonation-can help resolve textual ambiguity and reveal a speaker's true intent. DEBATE contains 1,001 carefully selected ambiguous utterances, each recorded by 10 native speakers, capturing diverse linguistic ambiguities and their disambiguation through speech. We detail the data collection pipeline and provide rigorous quality analysis. Additionally, we benchmark three state-of-the-art large speech and audio-language models, illustrating clear and huge performance gaps between machine and human understanding of spoken intent. DEBATE represents the first effort of its kind and offers a foundation for building similar DTS datasets across languages and cultures. The dataset and associated code are available at: https://github.com/SmileHnu/DEBATE.
Chinese: 本文推出了首个中文语音-文本数据集DEBATE,旨在研究语音特征如何消解文本歧义,揭示了人类与机器在口语意图理解上的显著性能差距。
English: This paper introduces DEBATE, the first public Chinese speech-text dataset designed to explore how speech cues resolve textual ambiguity, revealing significant performance gaps between human and machine understanding of spoken intent.

Authors:Libo Wang
Title: Graph-of-Causal Evolution: Challenging Chain-of-Model for Reasoning
Abstract:
In view of the problem that each subchain in the chain-of-model (CoM) relies only on the information of the previous subchain and may lose long-range dependencies due to the causal mask blocking the global context flow between multi-level subchains, this work proposes a graph of causal evolution (GoCE). Its core principle is to map the implicit token representation into a differentiable and sparse causal adjacency matrix, then permeate causal constraints through each layer of calculation using causal-masked attention and causal-MoE. By combining intervention consistency loss test and self-evolution gate, the dynamic balance between causal structure learning and adaptive updating of transformer architecture is realized. The researcher built experimental environments in sandboxes built with Claude Sonnet 4, o4-mini-high, and DeepSeek R1 respectively with the transformer variant architecture introduced in GoCE. It is evaluated on publicly available datasets including CLUTRR, CLADDER, EX-FEVER, and CausalQA and compared with the baseline LLMs. The finding proves that GoCE strengthens the transformer's ability to capture long-range causal dependencies, while the ability to self-evolve is improved. It not only surpasses the design of CoM in terms of design principles, but also provides experience for future research on causal learning and continuous adaptive improvement.
中文: 本研究提出因果演化图(GoCE)以解决链式模型中因因果掩码导致长程依赖丢失的问题,通过将隐式表征映射为稀疏因果邻接矩阵并结合因果注意力机制,增强了Transformer捕捉长程因果依赖与自我演化的能力,其性能优于基线模型。
English: This work introduces the Graph of Causal Evolution (GoCE) to address the loss of long-range dependencies in chain-of-model architectures by mapping token representations into a sparse causal adjacency matrix and integrating causal constraints through attention mechanisms, ultimately enhancing transformers' ability to capture causal relationships and self-evolve beyond baseline models.

Authors:Haoyuan Li, Rui Zhang, Snigdha Chaturvedi
Title: Improving Fairness of Large Language Models in Multi-document Summarization
Abstract:
Fairness in multi-document summarization (MDS) is crucial for providing comprehensive views across documents with diverse social attribute values, which can significantly impact decision-making. For example, a summarization system that tends to overrepresent negative reviews of products can mislead customers into disregarding good products. Previous works measure fairness in MDS at two levels: summary-level and corpus-level. While summary-level fairness focuses on individual summaries, corpus-level fairness focuses on a corpus of summaries. Recent methods primarily focus on summary-level fairness. We propose FairPO, a preference tuning method that focuses on both summary-level and corpus-level fairness in MDS. To improve summary-level fairness, we propose to generate preference pairs by perturbing document sets. To improve corpus-level fairness, we propose fairness-aware preference tuning by dynamically adjusting the weights of preference pairs. Our experiments show that FairPO outperforms strong baselines while maintaining the critical qualities of summaries. The code is available at https://github.com/leehaoyuan/coverage_fairnes.
中文: FairPO是一种新颖的偏好调优方法,通过生成扰动偏好对并动态调整其权重,在保持摘要质量的同时显著提升了多文档摘要的摘要级和语料库级公平性,性能优于现有基线。
English: FairPO is a novel preference tuning method that enhances both summary-level and corpus-level fairness in multi-document summarization by generating perturbed preference pairs and dynamically adjusting their weights, outperforming baselines while preserving summary quality.

Authors:Vahid Azizi, Fatemeh Koochaki
Title: LlamaRec-LKG-RAG: A Single-Pass, Learnable Knowledge Graph-RAG Framework for LLM-Based Ranking
Abstract:
Recent advances in Large Language Models (LLMs) have driven their adoption in recommender systems through Retrieval-Augmented Generation (RAG) frameworks. However, existing RAG approaches predominantly rely on flat, similarity-based retrieval that fails to leverage the rich relational structure inherent in user-item interactions. We introduce LlamaRec-LKG-RAG, a novel single-pass, end-to-end trainable framework that integrates personalized knowledge graph context into LLM-based recommendation ranking. Our approach extends the LlamaRec architecture by incorporating a lightweight user preference module that dynamically identifies salient relation paths within a heterogeneous knowledge graph constructed from user behavior and item metadata. These personalized subgraphs are seamlessly integrated into prompts for a fine-tuned Llama-2 model, enabling efficient and interpretable recommendations through a unified inference step. Comprehensive experiments on ML-100K and Amazon Beauty datasets demonstrate consistent and significant improvements over LlamaRec across key ranking metrics (MRR, NDCG, Recall). LlamaRec-LKG-RAG demonstrates the critical value of structured reasoning in LLM-based recommendations and establishes a foundation for scalable, knowledge-aware personalization in next-generation recommender systems. Code is available at~\href{https://github.com/VahidAz/LlamaRec-LKG-RAG}{repository}.
中文摘要:本文提出LlamaRec-LKG-RAG框架,通过将个性化知识图谱融入基于大语言模型的推荐系统,利用结构化推理显著提升了推荐排序性能。
English Summary: The paper introduces LlamaRec-LKG-RAG, a novel framework that integrates personalized knowledge graphs into LLM-based recommender systems, demonstrating improved ranking performance through structured reasoning.

Authors:Janghyeon Yun, Sang-goo Lee
Title: SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation
Abstract:
Text-to-SQL enables non-experts to retrieve data from databases by converting natural language queries into SQL. However, state-of-the-art text-to-SQL studies rely on the BIRD dataset, which assumes that evidence is provided along with questions. Although BIRD facilitates research advancements, it assumes that users have expertise and domain knowledge, contradicting the fundamental goal of text-to-SQL. In addition, human-generated evidence in BIRD contains defects, including missing or erroneous evidence, which affects model performance. To address this issue, we propose SEED (System for Evidence Extraction and Domain knowledge generation), an approach that automatically generates evidence to improve performance and practical usability in real-world scenarios. SEED systematically analyzes database schema, description files, and values to extract relevant information. We evaluated SEED on BIRD and Spider, demonstrating that it significantly improves SQL generation accuracy in the no-evidence scenario, and in some cases, even outperforms the setting where BIRD evidence is provided. Our results highlight that SEED-generated evidence not only bridges the gap between research and real-world deployment but also improves the adaptability and robustness of text-to-SQL models. Our code is available at https://github.com/felix01189/SEED
中文: SEED系统通过自动分析数据库结构生成证据,显著提升了无证据场景下文本转SQL的准确性,增强了模型在实际应用中的适应性和鲁棒性。
English: SEED is an automated evidence generation system that enhances text-to-SQL model performance by analyzing database components, achieving higher accuracy in no-evidence scenarios and improving real-world applicability.

Authors:Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, Shuicheng Yan
Title: G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems
Abstract:
Large language model (LLM)-powered multi-agent systems (MAS) have demonstrated cognitive and execution capabilities that far exceed those of single LLM agents, yet their capacity for self-evolution remains hampered by underdeveloped memory architectures. Upon close inspection, we are alarmed to discover that prevailing MAS memory mechanisms (1) are overly simplistic, completely disregarding the nuanced inter-agent collaboration trajectories, and (2) lack cross-trial and agent-specific customization, in stark contrast to the expressive memory developed for single agents. To bridge this gap, we introduce G-Memory, a hierarchical, agentic memory system for MAS inspired by organizational memory theory, which manages the lengthy MAS interaction via a three-tier graph hierarchy: insight, query, and interaction graphs. Upon receiving a new user query, G-Memory performs bi-directional memory traversal to retrieve both $\textit{high-level, generalizable insights}$ that enable the system to leverage cross-trial knowledge, and $\textit{fine-grained, condensed interaction trajectories}$ that compactly encode prior collaboration experiences. Upon task execution, the entire hierarchy evolves by assimilating new collaborative trajectories, nurturing the progressive evolution of agent teams. Extensive experiments across five benchmarks, three LLM backbones, and three popular MAS frameworks demonstrate that G-Memory improves success rates in embodied action and accuracy in knowledge QA by up to $20.89\%$ and $10.12\%$, respectively, without any modifications to the original frameworks. Our codes are available at https://github.com/bingreeky/GMemory.
中文: G-Memory作为一种层次化、智能的记忆系统,通过三层图结构管理多智能体系统的交互,无需修改原有框架即可显著提升任务成功率和准确性。
English: G-Memory is a hierarchical, agentic memory system designed to enhance multi-agent systems by managing interactions through a three-tier graph structure, which significantly improves task success rates and accuracy without altering existing frameworks.

Authors:Olga Kellert, Nemika Tyagi, Muhammad Imran, Nelvin Licona-Guevara, Carlos Gómez-Rodríguez
Title: Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages
Abstract:
Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject
中文:BiLingua Parser 是一种基于大语言模型的注释流程,能有效为语码转换文本生成通用依存关系标注,经专家修订后达到 95.29% 的LAS值,在低资源环境中显著优于现有解析器。
English: The BiLingua Parser, an LLM-based annotation pipeline, effectively generates Universal Dependencies annotations for code-switched text, achieving up to 95.29% LAS after expert revision and outperforming existing parsers in low-resource settings.

Authors:Hao Tang, Chengchao Shen
Title: Learning Compact Vision Tokens for Efficient Large Multimodal Models
Abstract:
Large multimodal models (LMMs) suffer significant computational challenges due to the high cost of Large Language Models (LLMs) and the quadratic complexity of processing long vision token sequences. In this paper, we explore the spatial redundancy among vision tokens and shorten the length of vision token sequences for inference acceleration. Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one. Meanwhile, weight-frozen vision encoder can not well adapt to the demand of extensive downstream vision-language tasks. To this end, we further introduce a Multi-Block Token Fusion (MBTF) module to supplement multi-granularity features for the reduced token sequence. Overall, we combine STF and MBTF module to balance token reduction and information preservation, thereby improving inference efficiency without sacrificing multimodal reasoning capabilities. Experimental results demonstrate that our method based on LLaVA-1.5 achieves comparable or even superior performance to the baseline on 8 popular vision-language benchmarks with only $25\%$ vision tokens of baseline. The source code and trained weights are available at https://github.com/visresearch/LLaVA-STF.
中文摘要:本文提出空间令牌融合方法,通过压缩视觉令牌并补充多粒度特征来降低大型多模态模型的计算成本,仅用基线25%的令牌即可实现相当甚至更优的性能。
English Summary: The paper introduces a Spatial Token Fusion method to reduce computational costs in large multimodal models by compressing vision tokens and enhancing them with multi-granularity features, achieving comparable performance with only 25% of baseline tokens.

Authors:Samir Abdaljalil, Hasan Kurban, Khalid Qaraqe, Erchin Serpedin
Title: Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
Abstract:
Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.
中文:定理思维(ToTh)框架通过将推理建模为三个采用不同推理模式的智能体协作过程,将其输出构建为推理图并通过贝叶斯信念传播评估一致性,从而优于现有方法并提供可解释、逻辑严密的推理结果。
English: The Theorem-of-Thought (ToTh) framework enhances LLM reasoning by modeling it as a collaborative process among three agents using distinct inference modes, structuring their outputs into reasoning graphs evaluated for coherence via Bayesian belief propagation, which outperforms existing methods and provides interpretable, logically grounded results.

Authors:Kai Xiong, Xiao Ding, Yixin Cao, Yuxiong Yan, Li Du, Yufei Zhang, Jinglong Gao, Jiaqian Liu, Bing Qin, Ting Liu
Title: Com$^2$: A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
Abstract:
Large language models (LLMs) have mastered abundant simple and explicit commonsense knowledge through pre-training, enabling them to achieve human-like performance in simple commonsense reasoning. Nevertheless, LLMs struggle to reason with complex and implicit commonsense knowledge that is derived from simple ones (such as understanding the long-term effects of certain events), an aspect humans tend to focus on more. Existing works focus on complex tasks like math and code, while complex commonsense reasoning remains underexplored due to its uncertainty and lack of structure. To fill this gap and align with real-world concerns, we propose a benchmark Com$^2$ focusing on complex commonsense reasoning. We first incorporate causal event graphs to serve as structured complex commonsense. Then we adopt causal theory~(e.g., intervention) to modify the causal event graphs and obtain different scenarios that meet human concerns. Finally, an LLM is employed to synthesize examples with slow thinking, which is guided by the logical relationships in the modified causal graphs. Furthermore, we use detective stories to construct a more challenging subset. Experiments show that LLMs struggle in reasoning depth and breadth, while post-training and slow thinking can alleviate this. The code and data are available at https://github.com/Waste-Wood/Com2.
中文摘要:大型语言模型擅长处理简单常识推理,但在复杂隐性知识方面表现不足,为此提出的Com²基准通过结构化因果推理和慢思考方法,旨在评估并提升其在此类场景下的能力。
English Summary: Large language models excel at simple commonsense reasoning but struggle with complex, implicit scenarios, prompting the creation of the Com² benchmark to evaluate and improve their performance in this area through structured causal reasoning and slow-thinking methodologies.

Authors:LASA Team, Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, Yu Sun, Junao Shen, Chaojun Wang, Jie Tan, Deli Zhao, Tingyang Xu, Hao Zhang, Yu Rong
Title: Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in understanding common visual elements, largely due to their large-scale datasets and advanced training strategies. However, their effectiveness in medical applications remains limited due to the inherent discrepancies between data and tasks in medical scenarios and those in the general domain. Concretely, existing medical MLLMs face the following critical limitations: (1) limited coverage of medical knowledge beyond imaging, (2) heightened susceptibility to hallucinations due to suboptimal data curation processes, (3) lack of reasoning capabilities tailored for complex medical scenarios. To address these challenges, we first propose a comprehensive data curation procedure that (1) efficiently acquires rich medical knowledge data not only from medical imaging but also from extensive medical texts and general-domain data; and (2) synthesizes accurate medical captions, visual question answering (VQA), and reasoning samples. As a result, we build a multimodal dataset enriched with extensive medical knowledge. Building on the curated data, we introduce our medical-specialized MLLM: Lingshu. Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities progressively. Besides, we preliminarily explore the potential of applying reinforcement learning with verifiable rewards paradigm to enhance Lingshu's medical reasoning ability. Additionally, we develop MedEvalKit, a unified evaluation framework that consolidates leading multimodal and textual medical benchmarks for standardized, fair, and efficient model assessment. We evaluate the performance of Lingshu on three fundamental medical tasks, multimodal QA, text-based QA, and medical report generation. The results show that Lingshu consistently outperforms the existing open-source multimodal models on most tasks ...
中文: 多模态大语言模型在通用视觉任务中表现出色,但在医疗应用中因知识覆盖不足和数据差异而受限,为此研发了基于丰富医疗数据训练的专用模型Lingshu,其在多项核心医疗任务中优于现有模型。
English: Multimodal Large Language Models (MLLMs) excel in general visual tasks but underperform in medical applications due to knowledge gaps and data limitations, prompting the development of Lingshu, a specialized model trained on enriched medical data that surpasses existing models in key medical tasks.

Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin'ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
Title: A Culturally-diverse Multilingual Multimodal Video Benchmark & Model
Abstract:
Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.
中文:本文提出了ViMUL-Bench多语言视频基准测试,用于评估涵盖14种语言和多元文化类别的大型多模态模型,同时开发的新型多语言视频LMM显著提升了语言包容性。
English: This paper introduces ViMUL-Bench, a multilingual video benchmark evaluating large multimodal models across 14 languages and diverse cultural categories, along with a new multilingual video LMM that improves language inclusivity.

Authors:Senqi Yang, Dongyu Zhang, Jing Ren, Ziqi Xu, Xiuzhen Zhang, Yiliao Song, Hongfei Lin, Feng Xia
Title: Cultural Bias Matters: A Cross-Cultural Benchmark Dataset and Sentiment-Enriched Model for Understanding Multimodal Metaphors
Abstract:
Metaphors are pervasive in communication, making them crucial for natural language processing (NLP). Previous research on automatic metaphor processing predominantly relies on training data consisting of English samples, which often reflect Western European or North American biases. This cultural skew can lead to an overestimation of model performance and contributions to NLP progress. However, the impact of cultural bias on metaphor processing, particularly in multimodal contexts, remains largely unexplored. To address this gap, we introduce MultiMM, a Multicultural Multimodal Metaphor dataset designed for cross-cultural studies of metaphor in Chinese and English. MultiMM consists of 8,461 text-image advertisement pairs, each accompanied by fine-grained annotations, providing a deeper understanding of multimodal metaphors beyond a single cultural domain. Additionally, we propose Sentiment-Enriched Metaphor Detection (SEMD), a baseline model that integrates sentiment embeddings to enhance metaphor comprehension across cultural backgrounds. Experimental results validate the effectiveness of SEMD on metaphor detection and sentiment analysis tasks. We hope this work increases awareness of cultural bias in NLP research and contributes to the development of fairer and more inclusive language models. Our dataset and code are available at https://github.com/DUTIR-YSQ/MultiMM.
中文摘要:本文提出MultiMM多文化多模态隐喻数据集,通过提供带标注的中英文广告对解决NLP中的文化偏见问题,并开发了情感增强检测模型,有效提升了跨文化隐喻理解能力。
English Summary: This paper introduces MultiMM, a multicultural multimodal metaphor dataset addressing cultural bias in NLP by providing Chinese and English advertisement pairs with annotations, and proposes a sentiment-enhanced detection model that demonstrates improved cross-cultural metaphor understanding.

Authors:Ziheng Qiao, Houquan Zhou, Zhenghua Li
Title: Mixture of Small and Large Models for Chinese Spelling Check
Abstract:
In the era of large language models (LLMs), the Chinese Spelling Check (CSC) task has seen various LLM methods developed, yet their performance remains unsatisfactory. In contrast, fine-tuned BERT-based models, relying on high-quality in-domain data, show excellent performance but suffer from edit pattern overfitting. This paper proposes a novel dynamic mixture approach that effectively combines the probability distributions of small models and LLMs during the beam search decoding phase, achieving a balanced enhancement of precise corrections from small models and the fluency of LLMs. This approach also eliminates the need for fine-tuning LLMs, saving significant time and resources, and facilitating domain adaptation. Comprehensive experiments demonstrate that our mixture approach significantly boosts error correction capabilities, achieving state-of-the-art results across multiple datasets. Our code is available at https://github.com/zhqiao-nlp/MSLLM.
中文: 本文提出一种动态混合方法,在集束搜索中结合小模型的精确修正与大语言模型的流畅性,无需微调即可实现最先进的中文拼写纠错效果。
English: This paper introduces a dynamic mixture approach that integrates small models' precision with LLMs' fluency during beam search, achieving state-of-the-art Chinese spelling correction without fine-tuning LLMs.

Authors:Tianjie Ju, Yujia Chen, Hao Fei, Mong-Li Lee, Wynne Hsu, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu
Title: On the Adaptive Psychological Persuasion of Large Language Models
Abstract:
Previous work has showcased the intriguing capabilities of Large Language Models (LLMs) in instruction-following and rhetorical fluency. However, systematic exploration of their dual capabilities to autonomously persuade and resist persuasion, particularly in contexts involving psychological rhetoric, remains unexplored. In this paper, we first evaluate four commonly adopted LLMs by tasking them to alternately act as persuaders and listeners in adversarial dialogues. Empirical results show that persuader LLMs predominantly employ repetitive strategies, leading to low success rates. Then we introduce eleven comprehensive psychological persuasion strategies, finding that explicitly instructing LLMs to adopt specific strategies such as Fluency Effect and Repetition Effect significantly improves persuasion success rates. However, no ``one-size-fits-all'' strategy proves universally effective, with performance heavily dependent on contextual counterfactuals. Motivated by these observations, we propose an adaptive framework based on direct preference optimization that trains LLMs to autonomously select optimal strategies by leveraging persuasion results from strategy-specific responses as preference pairs. Experiments on three open-source LLMs confirm that the proposed adaptive psychological persuasion method effectively enables persuader LLMs to select optimal strategies, significantly enhancing their success rates while maintaining general capabilities. Our code is available at https://github.com/KalinaEine/PsychologicalPersuasion.
中文: 本研究评估了大语言模型的说服能力,发现明确的心理策略可提高成功率,并提出一种自适应框架训练模型自主选择最优策略,显著提升了说服效果。
English: This study evaluates large language models' persuasion capabilities, finding that explicit psychological strategies improve success rates, and proposes an adaptive framework that trains models to autonomously select optimal strategies, significantly enhancing performance.

Authors:Walter Paci, Alessandro Panunzi, Sandro Pezzelle
Title: They want to pretend not to understand: The Limits of Current LLMs in Interpreting Implicit Content of Political Discourse
Abstract:
Implicit content plays a crucial role in political discourse, where speakers systematically employ pragmatic strategies such as implicatures and presuppositions to influence their audiences. Large Language Models (LLMs) have demonstrated strong performance in tasks requiring complex semantic and pragmatic understanding, highlighting their potential for detecting and explaining the meaning of implicit content. However, their ability to do this within political discourse remains largely underexplored. Leveraging, for the first time, the large IMPAQTS corpus, which comprises Italian political speeches with the annotation of manipulative implicit content, we propose methods to test the effectiveness of LLMs in this challenging problem. Through a multiple-choice task and an open-ended generation task, we demonstrate that all tested models struggle to interpret presuppositions and implicatures. We conclude that current LLMs lack the key pragmatic capabilities necessary for accurately interpreting highly implicit language, such as that found in political discourse. At the same time, we highlight promising trends and future directions for enhancing model performance. We release our data and code at https://github.com/WalterPaci/IMPAQTS-PID
中文摘要:当前大型语言模型尚缺乏准确解读政治话语中预设和言外之意等隐性内容的关键语用能力,但研究显示出未来提升模型性能的积极趋势。
English Summary: Large Language Models currently lack the pragmatic capabilities to accurately interpret implicit content like presuppositions and implicatures in political discourse, though promising trends suggest potential for future improvement.

Authors:Chunyuan Deng, Ruidi Chang, Hanjie Chen
Title: Learning Distribution-Wise Control in Representation Space for Language Models
Abstract:
Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: \href{https://github.com/chili-lab/D-Intervention}{https://github.com/chili-lab/D-Intervention}.
中文: 本研究提出语言模型的分布级干预方法,通过扩展概念子空间的调控范围实现更精细的行为控制,在常识推理与算术推理任务中展现出比点态干预更强的可控性与鲁棒性。
English: This study introduces distribution-wise interventions for language models, which extend beyond pointwise control to learn transformations across broader concept subspaces, demonstrating superior controllability and robustness in commonsense and arithmetic reasoning benchmarks compared to traditional methods.

Authors:Nikhita Vedula, Dushyanta Dhyani, Laleh Jalali, Boris Oreshkin, Mohsen Bayati, Shervin Malmasi
Title: Quantile Regression with Large Language Models for Price Prediction
Abstract:
Large Language Models (LLMs) have shown promise in structured prediction tasks, including regression, but existing approaches primarily focus on point estimates and lack systematic comparison across different methods. We investigate probabilistic regression using LLMs for unstructured inputs, addressing challenging text-to-distribution prediction tasks such as price estimation where both nuanced text understanding and uncertainty quantification are critical. We propose a novel quantile regression approach that enables LLMs to produce full predictive distributions, improving upon traditional point estimates. Through extensive experiments across three diverse price prediction datasets, we demonstrate that a Mistral-7B model fine-tuned with quantile heads significantly outperforms traditional approaches for both point and distributional estimations, as measured by three established metrics each for prediction accuracy and distributional calibration. Our systematic comparison of LLM approaches, model architectures, training approaches, and data scaling reveals that Mistral-7B consistently outperforms encoder architectures, embedding-based methods, and few-shot learning methods. Our experiments also reveal the effectiveness of LLM-assisted label correction in achieving human-level accuracy without systematic bias. Our curated datasets are made available at https://github.com/vnik18/llm-price-quantile-reg/ to support future research.
中文摘要:本研究提出了一种利用大语言模型的新型分位数回归方法,用于文本到分布预测任务(如价格估计),通过在多数据集和指标上的系统比较,证明了该方法在预测准确性和分布校准方面均优于传统方法。
English Summary: This study introduces a novel quantile regression method using Large Language Models (LLMs) to generate predictive distributions for text-to-distribution tasks like price estimation, demonstrating superior performance over traditional approaches through systematic comparisons across multiple datasets and metrics.

Authors:Jacqueline He, Howard Yen, Margaret Li, Shuyue Stella Li, Zhiyuan Zeng, Weijia Shi, Yulia Tsvetkov, Danqi Chen, Pang Wei Koh, Luke Zettlemoyer
Title: Precise Information Control in Long-Form Text Generation
Abstract:
A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.
中文: 该研究提出精确信息控制(PIC)任务以解决语言模型的幻觉问题,通过要求模型仅基于给定声明生成内容,发现即使先进模型仍有超过70%的幻觉率,并开发了后训练框架将模型性能提升至91% F1分数,显著增强了事实生成的可信度。
English: The study introduces Precise Information Control (PIC) to address language model hallucinations by generating outputs strictly based on provided claims, revealing that even advanced models hallucinate over 70% of the time, and proposes a post-training framework that significantly improves faithfulness in factual generation tasks.

Authors:Ho Yin 'Sam' Ng, Ting-Yao Hsu, Aashish Anantha Ramakrishnan, Branislav Kveton, Nedim Lipka, Franck Dernoncourt, Dongwon Lee, Tong Yu, Sungchul Kim, Ryan A. Rossi, Ting-Hao 'Kenneth' Huang
Title: LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
Abstract:
Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost always need to revise generic AI-generated captions to match their writing style and the domain's style, highlighting the need for personalization. Despite language models' personalization (LaMP) advances, these technologies often focus on text-only settings and rarely address scenarios where both inputs and profiles are multimodal. This paper introduces LaMP-Cap, a dataset for personalized figure caption generation with multimodal figure profiles. For each target figure, LaMP-Cap provides not only the needed inputs, such as figure images, but also up to three other figures from the same document--each with its image, caption, and figure-mentioning paragraphs--as a profile to characterize the context. Experiments with four LLMs show that using profile information consistently helps generate captions closer to the original author-written ones. Ablation studies reveal that images in the profile are more helpful than figure-mentioning paragraphs, highlighting the advantage of using multimodal profiles over text-only ones.
Chinese: 本文介绍了用于个性化图表标题生成的多模态数据集LaMP-Cap,实验表明,结合多模态背景信息(特别是图像)能显著提升AI生成标题与作者原创标题的契合度,优于纯文本方法。
English: The paper introduces LaMP-Cap, a multimodal dataset for personalized figure caption generation, demonstrating through experiments that incorporating multimodal profile information, especially images, significantly improves the alignment of AI-generated captions with author-written ones compared to text-only approaches.

Authors:Dor Tsur, Carol Xuan Long, Claudio Mayrink Verdun, Hsiang Hsu, Chen-Fu Chen, Haim Permuter, Sajani Vithana, Flavio P. Calmon
Title: HeavyWater and SimplexWater: Watermarking Low-Entropy Text Distributions
Abstract:
Large language model (LLM) watermarks enable authentication of text provenance, curb misuse of machine-generated text, and promote trust in AI systems. Current watermarks operate by changing the next-token predictions output by an LLM. The updated (i.e., watermarked) predictions depend on random side information produced, for example, by hashing previously generated tokens. LLM watermarking is particularly challenging in low-entropy generation tasks - such as coding - where next-token predictions are near-deterministic. In this paper, we propose an optimization framework for watermark design. Our goal is to understand how to most effectively use random side information in order to maximize the likelihood of watermark detection and minimize the distortion of generated text. Our analysis informs the design of two new watermarks: HeavyWater and SimplexWater. Both watermarks are tunable, gracefully trading-off between detection accuracy and text distortion. They can also be applied to any LLM and are agnostic to side information generation. We examine the performance of HeavyWater and SimplexWater through several benchmarks, demonstrating that they can achieve high watermark detection accuracy with minimal compromise of text generation quality, particularly in the low-entropy regime. Our theoretical analysis also reveals surprising new connections between LLM watermarking and coding theory. The code implementation can be found in https://github.com/DorTsur/HeavyWater_SimplexWater
中文: 本文提出了一种可优化的水印设计框架,开发了HeavyWater和SimplexWater两种水印技术,能在保持文本质量的同时实现高检测精度,适用于各类语言模型和辅助信息生成方式。
English: This paper introduces an optimization framework for designing tunable LLM watermarks, HeavyWater and SimplexWater, which effectively balance detection accuracy and text quality while being applicable to any language model and side information generation method.

Authors:Hengzhi Li, Brendon Jiang, Alexander Naehu, Regan Song, Justin Zhang, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang
Title: PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
Abstract:
Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions, puzzlehunts require models to discover the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite recent progress in foundation models, their performance on such open-ended settings remains largely untested. In this paper, we introduce PuzzleWorld, a large-scale benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-2% final answer accuracy, with the best model solving only 14% of puzzles and reaching 40% stepwise accuracy. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces improves stepwise reasoning from 4% to 11%, while training on final answers alone degrades performance to near zero. Our error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
中文: PuzzleWorld是一个包含667个解谜式问题的新基准,用于评估开放式多模态推理,当前最优模型仅解决14%的谜题,并暴露出短视推理和缺乏草图能力等局限。
English: PuzzleWorld is a new benchmark of 667 puzzlehunt problems that tests open-ended multimodal reasoning, where current top models struggle with only 14% solved and reveal limitations like myopic reasoning and lack of sketching skills.

Authors:Jinyu Yang, Cheng Yang, Shanyuan Cui, Zeyuan Guo, Liangwei Yang, Muhan Zhang, Zhiqiang Zhang, Chuan Shi
Title: Masked Language Models are Good Heterogeneous Graph Generalizers
Abstract:
Heterogeneous graph neural networks (HGNNs) excel at capturing structural and semantic information in heterogeneous graphs (HGs), while struggling to generalize across domains and tasks. With the rapid advancement of large language models (LLMs), a recent study explored the integration of HGNNs with LLMs for generalizable heterogeneous graph learning. However, this approach typically encodes structural information as HG tokens using HGNNs, and disparities in embedding spaces between HGNNs and LLMs have been shown to bias the LLM's comprehension of HGs. Moreover, since these HG tokens are often derived from node-level tasks, the model's ability to generalize across tasks remains limited. To this end, we propose a simple yet effective Masked Language Modeling-based method, called MLM4HG. MLM4HG introduces metapath-based textual sequences instead of HG tokens to extract structural and semantic information inherent in HGs, and designs customized textual templates to unify different graph tasks into a coherent cloze-style 'mask' token prediction paradigm. Specifically,MLM4HG first converts HGs from various domains to texts based on metapaths, and subsequently combines them with the unified task texts to form a HG-based corpus. Moreover, the corpus is fed into a pretrained LM for fine-tuning with a constrained target vocabulary, enabling the fine-tuned LM to generalize to unseen target HGs. Extensive cross-domain and multi-task experiments on four real-world datasets demonstrate the superior generalization performance of MLM4HG over state-of-the-art methods in both few-shot and zero-shot scenarios. Our code is available at https://github.com/BUPT-GAMMA/MLM4HG.
中文摘要:MLM4HG是一种创新方法,通过将异质图转换为基于元路径的文本序列,并利用填空式模板统一不同图任务,在语言模型微调后实现了优异的跨领域泛化性能。
English Summary: MLM4HG is a novel method that converts heterogeneous graphs into metapath-based textual sequences and unifies graph tasks through cloze-style templates, enabling superior cross-domain generalization when fine-tuned with language models.

Authors:David Wan, Han Wang, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Title: CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval
Abstract:
Online video web content is richly multimodal: a single video blends vision, speech, ambient audio, and on-screen text. Retrieval systems typically treat these modalities as independent retrieval sources, which can lead to noisy and subpar retrieval. We explore multimodal video content retrieval, where relevance can be scored from one particular modality or jointly across multiple modalities simultaneously. Consequently, an effective retriever must dynamically choose which modality (or set of modalities) best addresses the query. We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata. CLaMR jointly encodes all modalities with a unified multimodal backbone for improved contextualization and is trained to enhance dynamic modality selection via two key innovations. First, given the lack of training data for multimodal retrieval, we introduce MultiVENT 2.0++, a large-scale synthetic training dataset built on MultiVENT 2.0 (event-centric videos in various languages paired with queries) with modality-targeted queries. Next, we propose a modality-aware loss that jointly trains according to a standard contrastive objective alongside an objective for learning correct modality usage. On the test sets of MultiVENT 2.0++ and MSRVTT, conventional aggregation strategies, such as averaging similarities for baseline retrievers, degrade performance by introducing noise from irrelevant modalities. In contrast, CLaMR consistently outperforms existing retrievers: on MultiVENT 2.0++, CLaMR improves nDCG@10 by 25.6 over the best single-modality retriever and by 35.4 over the best multi-modality retriever. We illustrate CLaMR's downstream utility on long-video QA, retrieving relevant frames and obtaining a 3.50% boost over LanguageBind on Video-MME and 1.42% over dense sampling on LongVideoBench.
中文: 本研究提出CLaMR多模态视频检索系统,通过动态选择相关模态并联合编码四种内容类型,结合合成数据集和模态感知训练方法,显著超越了现有检索模型的性能表现。
English: The study introduces CLaMR, a multimodal video retriever that dynamically selects relevant modalities and jointly encodes four types of content, achieving significant performance improvements over existing methods by using a synthetic dataset and a modality-aware training approach.

Authors:Maor Ashkenazi, Ofir Brenner, Tal Furman Shohet, Eran Treister
Title: Zero-Shot Detection of LLM-Generated Code via Approximated Task Conditioning
Abstract:
Detecting Large Language Model (LLM)-generated code is a growing challenge with implications for security, intellectual property, and academic integrity. We investigate the role of conditional probability distributions in improving zero-shot LLM-generated code detection, when considering both the code and the corresponding task prompt that generated it. Our key insight is that when evaluating the probability distribution of code tokens using an LLM, there is little difference between LLM-generated and human-written code. However, conditioning on the task reveals notable differences. This contrasts with natural language text, where differences exist even in the unconditional distributions. Leveraging this, we propose a novel zero-shot detection approach that approximates the original task used to generate a given code snippet and then evaluates token-level entropy under the approximated task conditioning (ATC). We further provide a mathematical intuition, contextualizing our method relative to previous approaches. ATC requires neither access to the generator LLM nor the original task prompts, making it practical for real-world applications. To the best of our knowledge, it achieves state-of-the-art results across benchmarks and generalizes across programming languages, including Python, CPP, and Java. Our findings highlight the importance of task-level conditioning for LLM-generated code detection. The supplementary materials and code are available at https://github.com/maorash/ATC, including the dataset gathering implementation, to foster further research in this area.
中文: 本研究提出了一种新颖的零样本检测方法,通过评估近似任务条件下的标记熵来区分LLM生成的代码与人工编写的代码,该方法无需访问生成模型或原始提示,就在多种编程语言中实现了最优性能。
English: This study introduces a novel zero-shot detection method that leverages task-level conditioning to distinguish LLM-generated code from human-written code by evaluating token entropy under approximated task conditions, achieving state-of-the-art performance across multiple programming languages without requiring access to the generator model or original prompts.

Authors:Zeqi Zhou, Fang Wu, Shayan Talaei, Haokai Zhao, Cheng Meixin, Tinson Xu, Amin Saberi, Yejin Choi
Title: When to Trust Context: Self-Reflective Debates for Context Reliability
Abstract:
Large language models frequently encounter conflicts between their parametric knowledge and contextual input, often resulting in factual inconsistencies or hallucinations. We propose Self-Reflective Debate for Contextual Reliability (SR-DCR), a lightweight framework that integrates token-level self-confidence with an asymmetric multi-agent debate to adjudicate such conflicts. A critic, deprived of context, challenges a defender who argues from the given passage; a judge model evaluates the debate and determines the context's reliability. The final answer is selected by combining the verdict with model confidence. Experiments on the ClashEval benchmark demonstrate that SR-DCR consistently enhances robustness to misleading context while maintaining accuracy on trustworthy inputs, outperforming both classical debate and confidence-only baselines with minimal computational overhead. The code is available at https://github.com/smiles724/Self-Reflective-Debates.
中文: SR-DCR框架通过多智能体辩论机制解决参数知识与上下文输入的冲突,在保持准确性的同时显著提升对误导性语境的鲁棒性,且计算开销极低。
English: The SR-DCR framework uses a multi-agent debate process to resolve conflicts between parametric knowledge and contextual input, improving robustness against misleading contexts while maintaining accuracy with minimal computational cost.

Authors:Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li
Title: AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search
Abstract:
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains. However, designing high-performing agentic systems remains challenging. Existing agent search methods suffer from three major limitations: (1) an emphasis on optimizing agentic workflows while under-utilizing proven human-designed components such as memory, planning, and tool use; (2) high evaluation costs, as each newly generated agent must be fully evaluated on benchmarks; and (3) inefficient search in large search space. In this work, we introduce a comprehensive framework to address these challenges. First, We propose a hierarchical search space that jointly models agentic workflow and composable functional components, enabling richer agentic system designs. Building on this structured design space, we introduce a predictive value model that estimates agent performance given agentic system and task description, allowing for efficient, low-cost evaluation during the search process. Finally, we present a hierarchical Monte Carlo Tree Search (MCTS) strategy informed by uncertainty to guide the search. Experiments on seven benchmarks, covering embodied, math, web, tool, and game, show that our method achieves an average performance gain of 8.34\% over state-of-the-art baselines and exhibits faster search progress with steeper improvement trajectories. Code repo is available at https://github.com/Ericccc02/AgentSwift.
中文: 本文提出一个综合框架,通过整合分层搜索空间、预测性能模型和不确定性引导的MCTS,显著提升LLM智能体性能,在七大基准测试中平均提升8.34%。
English: This paper introduces a comprehensive framework that enhances LLM agent performance by integrating hierarchical search spaces, predictive value models, and uncertainty-guided MCTS, achieving an 8.34% average improvement across seven benchmarks.

Authors:Haoke Zhang, Xiaobo Liang, Cunxiang Wang, Juntao Li, Min Zhang
Title: Unlocking Recursive Thinking of LLMs: Alignment via Refinement
Abstract:
The OpenAI o1-series models have demonstrated that leveraging long-form Chain of Thought (CoT) can substantially enhance performance. However, the recursive thinking capabilities of Large Language Models (LLMs) remain limited, particularly in the absence of expert-curated data for distillation. In this paper, we propose \textbf{AvR}: \textbf{Alignment via Refinement}, a novel method aimed at unlocking the potential of LLMs for recursive reasoning through long-form CoT. AvR introduces a refinement process that integrates criticism and improvement actions, guided by differentiable learning techniques to optimize \textbf{refinement-aware rewards}. As a result, the synthesized multi-round data can be organized as a long refinement thought, further enabling test-time scaling. Experimental results show that AvR significantly outperforms conventional preference optimization methods. Notably, with only 3k synthetic samples, our method boosts the performance of the LLaMA-3-8B-Instruct model by over 20\% in win rate on AlpacaEval 2.0. Our code is available at Github (https://github.com/Banner-Z/AvR.git).
中文摘要:AvR方法通过引入结合可微分学习的优化过程,利用长链思维增强大语言模型的递归推理能力,仅用少量合成数据就显著超越传统方法,使LLaMA-3-8B-Instruct模型在AlpacaEval 2.0上的胜率提升超20%。
English Summary: The AvR method enhances LLMs' recursive reasoning through long-form Chain of Thought by integrating refinement processes with differentiable learning, significantly outperforming traditional methods and boosting LLaMA-3-8B-Instruct's performance by over 20% with minimal data.

Authors:Jana Straková, Milan Straka
Title: NameTag 3: A Tool and a Service for Multilingual/Multitagset NER
Abstract:
We introduce NameTag 3, an open-source tool and cloud-based web service for multilingual, multidataset, and multitagset named entity recognition (NER), supporting both flat and nested entities. NameTag 3 achieves state-of-the-art results on 21 test datasets in 15 languages and remains competitive on the rest, even against larger models. It is available as a command-line tool and as a cloud-based service, enabling use without local installation. NameTag 3 web service currently provides flat NER for 17 languages, trained on 21 corpora and three NE tagsets, all powered by a single 355M-parameter fine-tuned model; and nested NER for Czech, powered by a 126M fine-tuned model. The source code is licensed under open-source MPL 2.0, while the models are distributed under non-commercial CC BY-NC-SA 4.0. Documentation is available at https://ufal.mff.cuni.cz/nametag, source code at https://github.com/ufal/nametag3, and trained models via https://lindat.cz. The REST service and the web application can be found at https://lindat.mff.cuni.cz/services/nametag/. A demonstration video is available at https://www.youtube.com/watch?v=-gaGnP0IV8A.
NameTag 3 是一款开源工具及云端服务,支持多语言命名实体识别,在 15 种语言的 21 个数据集上达到顶尖性能,并通过精调模型提供扁平与嵌套实体识别功能。
NameTag 3 is an open-source tool and cloud service for multilingual named entity recognition, achieving top performance across 21 datasets in 15 languages and offering both flat and nested entity support through fine-tuned models.

Authors:Xinjie Zhang, Wenxuan Wang, Qin Jin
Title: IntentionESC: An Intention-Centered Framework for Enhancing Emotional Support in Dialogue Systems
Abstract:
In emotional support conversations, unclear intentions can lead supporters to employ inappropriate strategies, inadvertently imposing their expectations or solutions on the seeker. Clearly defined intentions are essential for guiding both the supporter's motivations and the overall emotional support process. In this paper, we propose the Intention-centered Emotional Support Conversation (IntentionESC) framework, which defines the possible intentions of supporters in emotional support conversations, identifies key emotional state aspects for inferring these intentions, and maps them to appropriate support strategies. While Large Language Models (LLMs) excel in text generating, they fundamentally operate as probabilistic models trained on extensive datasets, lacking a true understanding of human thought processes and intentions. To address this limitation, we introduce the Intention Centric Chain-of-Thought (ICECoT) mechanism. ICECoT enables LLMs to mimic human reasoning by analyzing emotional states, inferring intentions, and selecting suitable support strategies, thereby generating more effective emotional support responses. To train the model with ICECoT and integrate expert knowledge, we design an automated annotation pipeline that produces high-quality training data. Furthermore, we develop a comprehensive evaluation scheme to assess emotional support efficacy and conduct extensive experiments to validate our framework. Our data and code are available at https://github.com/43zxj/IntentionESC_ICECoT.
中文摘要:本文提出的IntentionESC框架和ICECoT机制通过明确支持者意图并将其与适当策略关联,提升了情感支持对话的效果,并通过自动化标注和综合评估验证了其有效性。
English Summary: The IntentionESC framework and ICECoT mechanism are introduced to enhance emotional support in conversations by clarifying supporter intentions and linking them to appropriate strategies, with automated annotation and comprehensive evaluation validating their effectiveness.

Authors:Jie Cao, Tianwei Lin, Hongyang He, Rolan Yan, Wenqiao Zhang, Juncheng Li, Dongping Zhang, Siliang Tang, Yueting Zhuang
Title: MoA: Heterogeneous Mixture of Adapters for Parameter-Efficient Fine-Tuning of Large Language Models
Abstract:
Recent studies integrate Low-Rank Adaptation (LoRA) and Mixture-of-Experts (MoE) to further enhance the performance of parameter-efficient fine-tuning (PEFT) methods in Large Language Model (LLM) applications. Existing methods employ \emph{homogeneous} MoE-LoRA architectures composed of LoRA experts with either similar or identical structures and capacities. However, these approaches often suffer from representation collapse and expert load imbalance, which negatively impact the potential of LLMs. To address these challenges, we propose a \emph{heterogeneous} \textbf{Mixture-of-Adapters (MoA)} approach. This method dynamically integrates PEFT adapter experts with diverse structures, leveraging their complementary representational capabilities to foster expert specialization, thereby enhancing the effective transfer of pre-trained knowledge to downstream tasks. MoA supports two variants: \textbf{(i)} \textit{Soft MoA} achieves fine-grained integration by performing a weighted fusion of all expert outputs; \textbf{(ii)} \textit{Sparse MoA} activates adapter experts sparsely based on their contribution, achieving this with negligible performance degradation. Experimental results demonstrate that heterogeneous MoA outperforms homogeneous MoE-LoRA methods in both performance and parameter efficiency. Our project is available at https://github.com/DCDmllm/MoA.
中文: 提出的异构混合适配器方法通过动态整合多样化适配器专家,克服了同构MoE-LoRA架构的局限性,在大语言模型微调中实现了性能与参数效率的双重提升。
English: The proposed heterogeneous Mixture-of-Adapters (MoA) method overcomes limitations of homogeneous MoE-LoRA architectures by dynamically integrating diverse adapter experts, enhancing both performance and parameter efficiency in large language model fine-tuning.

Authors:Xiaofei Xu, Xiuzhen Zhang, Ke Deng
Title: Generating Grounded Responses to Counter Misinformation via Learning Efficient Fine-Grained Critiques
Abstract:
Fake news and misinformation poses a significant threat to society, making efficient mitigation essential. However, manual fact-checking is costly and lacks scalability. Large Language Models (LLMs) offer promise in automating counter-response generation to mitigate misinformation, but a critical challenge lies in their tendency to hallucinate non-factual information. Existing models mainly rely on LLM self-feedback to reduce hallucination, but this approach is computationally expensive. In this paper, we propose MisMitiFact, Misinformation Mitigation grounded in Facts, an efficient framework for generating fact-grounded counter-responses at scale. MisMitiFact generates simple critique feedback to refine LLM outputs, ensuring responses are grounded in evidence. We develop lightweight, fine-grained critique models trained on data sourced from readily available fact-checking sites to identify and correct errors in key elements such as numerals, entities, and topics in LLM generations. Experiments show that MisMitiFact generates counter-responses of comparable quality to LLMs' self-feedback while using significantly smaller critique models. Importantly, it achieves ~5x increase in feedback generation throughput, making it highly suitable for cost-effective, large-scale misinformation mitigation. Code and LLM prompt templates are at https://github.com/xxfwin/MisMitiFact.
Chinese: MisMitiFact 是一个高效框架,通过轻量级评论模型生成基于事实的反驳回应来对抗虚假信息,在达到与大型语言模型自反馈相当质量的同时,将反馈生成吞吐量提升约5倍,适用于大规模应用。
English: MisMitiFact is an efficient framework that uses lightweight critique models to generate fact-grounded counter-responses against misinformation, achieving comparable quality to LLM self-feedback while significantly increasing throughput by approximately 5 times for scalable mitigation.

Authors:Taiga Shinozaki, Tomoki Doi, Amane Watahiki, Satoshi Nishida, Hitomi Yanaka
Title: Do Large Vision-Language Models Distinguish between the Actual and Apparent Features of Illusions?
Abstract:
Humans are susceptible to optical illusions, which serve as valuable tools for investigating sensory and cognitive processes. Inspired by human vision studies, research has begun exploring whether machines, such as large vision language models (LVLMs), exhibit similar susceptibilities to visual illusions. However, studies often have used non-abstract images and have not distinguished actual and apparent features, leading to ambiguous assessments of machine cognition. To address these limitations, we introduce a visual question answering (VQA) dataset, categorized into genuine and fake illusions, along with corresponding control images. Genuine illusions present discrepancies between actual and apparent features, whereas fake illusions have the same actual and apparent features even though they look illusory due to the similar geometric configuration. We evaluate the performance of LVLMs for genuine and fake illusion VQA tasks and investigate whether the models discern actual and apparent features. Our findings indicate that although LVLMs may appear to recognize illusions by correctly answering questions about both feature types, they predict the same answers for both Genuine Illusion and Fake Illusion VQA questions. This suggests that their responses might be based on prior knowledge of illusions rather than genuine visual understanding. The dataset is available at https://github.com/ynklab/FILM
Chinese: 本研究通过引入新型数据集评估大型视觉语言模型对视觉错觉的真实感知能力,发现尽管模型能正确回答问题,但其反应可能基于先验知识而非真正的视觉理解。
English: This study introduces a dataset to assess whether large vision language models genuinely perceive visual illusions or rely on prior knowledge, revealing that their responses may not reflect true visual understanding despite correct answers.

Authors:Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Yifan Li, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li
Title: Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness
Abstract:
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.
中文摘要:本文提出了一种知识遗忘评估框架,通过知识图谱和大型语言模型作为评判者,更准确地评估大语言模型中已遗忘事实的隐性存留,发现现有方法高估了遗忘效果。
English Summary: This paper introduces a knowledge unlearning evaluation framework that uses knowledge graphs and LLM judges to more accurately assess the implicit persistence of forgotten facts in large language models, revealing that current methods overestimate unlearning effectiveness.

Authors:Fang Wu, Vijay Prakash Dwivedi, Jure Leskovec
Title: Large Language Models are Good Relational Learners
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various domains, yet their application to relational deep learning (RDL) remains underexplored. Existing approaches adapt LLMs by traversing relational links between entities in a database and converting the structured data into flat text documents. Still, this text-based serialization disregards critical relational structures, introduces redundancy, and often exceeds standard LLM context lengths. We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for LLMs within a retrieval-augmented generation (RAG) framework. Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to effectively process and reason over complex entity relationships. Specifically, the GNN encoder extracts a local subgraph around an entity to build feature representations that contain relevant entity relationships and temporal dependencies. These representations are transformed into structured prompts using a denormalization process, effectively allowing the LLM to reason over relational structures. Through extensive experiments, we demonstrate that Rel-LLM outperforms existing methods on key RDL tasks, offering a scalable and efficient approach to integrating LLMs with structured data sources. Code is available at https://github.com/smiles724/Rel-LLM.
Chinese: Rel-LLM提出了一种新颖架构,通过基于图神经网络的编码器在检索增强生成框架中生成结构化关系提示,既保留了数据库的内在关系结构,又使大语言模型能有效处理复杂实体关系,在关键关系深度学习任务上超越了现有方法。
English: Rel-LLM introduces a novel architecture that uses a GNN-based encoder to generate structured relational prompts within a RAG framework, preserving database structures and enabling LLMs to effectively reason over complex entity relationships, outperforming existing methods on key RDL tasks.

Authors:Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong, Xiao Huang, Jinsong Su
Title: When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Abstract:
Graph retrieval-augmented generation (GraphRAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) with external knowledge. It leverages graphs to model the hierarchical structure between specific concepts, enabling more coherent and effective knowledge retrieval for accurate reasoning.Despite its conceptual promise, recent studies report that GraphRAG frequently underperforms vanilla RAG on many real-world tasks. This raises a critical question: Is GraphRAG really effective, and in which scenarios do graph structures provide measurable benefits for RAG systems? To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning. GraphRAG-Bench features a comprehensive dataset with tasks of increasing difficulty, coveringfact retrieval, complex reasoning, contextual summarization, and creative generation, and a systematic evaluation across the entire pipeline, from graph constructionand knowledge retrieval to final generation. Leveraging this novel benchmark, we systematically investigate the conditions when GraphRAG surpasses traditional RAG and the underlying reasons for its success, offering guidelines for its practical application. All related resources and analyses are collected for the community at https://github.com/GraphRAG-Bench/GraphRAG-Benchmark.
Chinese: GraphRAG-Bench是一个全面基准,通过测试不同难度任务中的层次知识检索和上下文推理,旨在评估GraphRAG何时及为何优于传统RAG。
English: GraphRAG-Bench is a comprehensive benchmark introduced to evaluate when and why GraphRAG outperforms traditional RAG by testing hierarchical knowledge retrieval and contextual reasoning across tasks of varying difficulty.

Authors:Keinichi Fujita, Shota Horiguchi, Yusuke Ijima
Title: Voice Impression Control in Zero-Shot TTS
Abstract:
Para-/non-linguistic information in speech is pivotal in shaping the listeners' impression. Although zero-shot text-to-speech (TTS) has achieved high speaker fidelity, modulating subtle para-/non-linguistic information to control perceived voice characteristics, i.e., impressions, remains challenging. We have therefore developed a voice impression control method in zero-shot TTS that utilizes a low-dimensional vector to represent the intensities of various voice impression pairs (e.g., dark-bright). The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control. Furthermore, generating this vector via a large language model enables target-impression generation from a natural language description of the desired impression, thus eliminating the need for manual optimization. Audio examples are available on our demo page (https://ntt-hilab-gensp.github.io/is2025voiceimpression/).

Authors:Chao Zhang, Li Wang, Samson Lasaulce, Merouane Debbah
Title: BAQ: Efficient Bit Allocation Quantization for Large Language Models
Abstract:
Post-training model quantization is a widely adopted technique for reducing the memory and computational costs of large language models (LLMs). However, most existing methods rely on uniform or heuristic bitwidth assignments, failing to account for the nonuniform sensitivity of weights to quantization noise. In this paper, we propose a novel framework for allocating quantization bitwidths based on sensitivity metrics derived from a Hessian proxy. We make key assumptions, which allow the layer/component-wise loss function to be expressed as an explicit function of the bitwidths. This enables a neat formulation of the bit allocation problem as a convex optimization task, whose closed-form solution adapts precision across weights to minimize the layer-wise quantization loss. Inspecting the solution provides several insights (such as the equal-loss structure), which are then exploited to design the proposed \textbf{BAQ} (Bit Allocation Quantization) algorithm. The proposed algorithm achieves a good trade-off between loss minimization and complexity and allows BAQ to be integrated into standard quantization pipelines with minimal overhead. Experimental results show that BAQ consistently outperforms GPTQ, achieving up to 56$\times$ lower perplexity at the same bitwidth on large language models ranging from 125M to 30B parameters. Leveraging our analytical results derived from solving the optimal bit allocation problem, we also provide a theoretical explanation for the observed gains. All codes of this paper are available at https://github.com/CSU-ModelCompression/BAQ.
中文摘要:本文提出BAQ量化框架,通过基于Hessian的敏感度指标和凸优化实现最优位宽分配,在多种规模的大语言模型上相比GPTQ实现高达56倍的困惑度降低。
English Summary: This paper introduces BAQ, a novel quantization framework that optimally allocates bitwidths using Hessian-based sensitivity metrics and convex optimization, significantly outperforming GPTQ with up to 56× lower perplexity across various LLM sizes.

Authors:Junjie Xing, Yeye He, Mengyu Zhou, Haoyu Dong, Shi Han, Lingjiao Chen, Dongmei Zhang, Surajit Chaudhuri, H. V. Jagadish
Title: MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark
Abstract:
Tables and table-based use cases play a crucial role in many important real-world applications, such as spreadsheets, databases, and computational notebooks, which traditionally require expert-level users like data engineers, data analysts, and database administrators to operate. Although LLMs have shown remarkable progress in working with tables (e.g., in spreadsheet and database copilot scenarios), comprehensive benchmarking of such capabilities remains limited. In contrast to an extensive and growing list of NLP benchmarks, evaluations of table-related tasks are scarce, and narrowly focus on tasks like NL-to-SQL and Table-QA, overlooking the broader spectrum of real-world tasks that professional users face. This gap limits our understanding and model progress in this important area. In this work, we introduce MMTU, a large-scale benchmark with over 30K questions across 25 real-world table tasks, designed to comprehensively evaluate models ability to understand, reason, and manipulate real tables at the expert-level. These tasks are drawn from decades' worth of computer science research on tabular data, with a focus on complex table tasks faced by professional users. We show that MMTU require a combination of skills -- including table understanding, reasoning, and coding -- that remain challenging for today's frontier models, where even frontier reasoning models like OpenAI o4-mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement. We highlight key findings in our evaluation using MMTU and hope that this benchmark drives further advances in understanding and developing foundation models for structured data processing and analysis. Our code and data are available at https://github.com/MMTU-Benchmark/MMTU and https://huggingface.co/datasets/MMTU-benchmark/MMTU.
中文:MMTU是一个包含超过3万个问题、涵盖25种专家级表格任务的大规模基准,旨在全面评估模型对真实表格的理解、推理和操作能力,结果显示即使顶级模型得分也仅约60%,表明仍有巨大改进空间。
English: MMTU is a large-scale benchmark with over 30,000 questions across 25 expert-level table tasks, designed to comprehensively evaluate models' ability to understand, reason, and manipulate real tables, revealing significant room for improvement as even top models score only around 60%.

Authors:Ludovic Arnould, Salim Khazem, Hugues Ali Mehenni
Title: BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models
Abstract:
Visual Language Models (VLMs) are now sufficiently advanced to support a broad range of applications, including answering complex visual questions, and are increasingly expected to interact with images in varied ways. To evaluate them, current benchmarks often focus on specific domains (e.g., reading charts), constructing datasets of annotated real images paired with pre-defined Multiple Choice Questions (MCQs) to report aggregate accuracy scores. However, such benchmarks entail high annotation costs, risk information leakage, and do not clarify whether failures stem from limitations in visual perception, reasoning, or general knowledge. We propose a new evaluation methodology, inspired by ophthalmologic diagnostics, leveraging procedural generation of synthetic images to obtain control over visual attributes and precisely reveal perception failures in VLMs. Specifically, we build collections of images with gradually more challenging variations in the content of interest (e.g., number of objects in a counting task) while holding other visual parameters constant. This diagnostic allows systematic stress testing and fine-grained failure analysis, shifting the focus from coarse benchmarking toward targeted and interpretable assessment of VLM capabilities. Our code is available at https://github.com/byoeval/BYO-EVAL.
Chinese: 该摘要提出了一种新的视觉语言模型评估方法,通过程序生成合成图像来系统性地测试并精确识别感知缺陷,超越了依赖标注真实图像和总体准确率的传统基准测试。
English: The abstract proposes a new diagnostic evaluation method for Visual Language Models (VLMs) using procedurally generated synthetic images to systematically test and precisely identify perception failures, moving beyond traditional benchmarks that rely on annotated real images and aggregate accuracy scores.

Authors:Patrik Czakó, Gábor Kertész, Sándor Szénási
Title: SmoothRot: Combining Channel-Wise Scaling and Rotation for Quantization-Friendly LLMs
Abstract:
We present SmoothRot, a novel post-training quantization technique to enhance the efficiency of 4-bit quantization in Large Language Models (LLMs). SmoothRot addresses the critical challenge of massive activation outliers, by integrating channel-wise scaling with Hadamard transformations. Our technique effectively transforms extreme outliers into quantization-friendly activations, significantly improving quantization accuracy. Experiments conducted on popular LLMs (LLaMA2 7B, LLaMA3.1 8B, and Mistral 7B) demonstrate that SmoothRot consistently reduces the performance gap between quantized and FP16 models by approximately 10-30\% across language generation and zero-shot reasoning tasks, without introducing additional inference latency. Code is available at https://github.com/czakop/smoothrot.
中文: SmoothRot是一种新颖的训练后量化技术,通过将激活异常值转化为量化友好形式,显著提升大型语言模型的4位量化效率,在不增加延迟的情况下将性能差距缩小10-30%。
English: SmoothRot is a novel post-training quantization technique that enhances 4-bit quantization efficiency in Large Language Models by transforming activation outliers into quantization-friendly forms, significantly reducing performance gaps by 10-30% without added latency.

Authors:Zory Zhang, Pinyuan Feng, Bingyang Wang, Tianwei Zhao, Suyang Yu, Qingying Gao, Hokin Deng, Ziqiao Ma, Yijiang Li, Dezhi Luo
Title: Can Vision Language Models Infer Human Gaze Direction? A Controlled Study
Abstract:
Gaze-referential inference--the ability to infer what others are looking at--is a critical component of a theory of mind that underpins natural human-AI interaction. In a controlled study, we evaluated this skill across 111 Vision Language Models (VLMs) using photos taken with manipulated difficulty and variability, comparing performance with that of human participants (N = 65), and analyzed behaviors using mixed-effects models. We found that 94 of the 111 VLMs failed to do better than random guessing, while humans achieved near-ceiling accuracy. VLMs even respond with each choice almost equally frequently. Are they randomly guessing? Although most VLMs struggle, when we zoom in on five of the top-tier VLMs with above-chance performance, we find that their performance declined with increasing task difficulty but varied only slightly across different prompts and scene objects. These behavioral features cannot be explained by considering them as random guessers. Instead, they likely use a combination of heuristics and guessing such that their performance is subject to the task difficulty but robust to perceptual variations. This suggests that VLMs, lacking gaze inference capability, have yet to become technologies that can naturally interact with humans, but the potential remains.

Authors:Shenyang Huang, Ali Parviz, Emma Kondrup, Zachary Yang, Zifeng Ding, Michael Bronstein, Reihaneh Rabbany, Guillaume Rabusseau
Title: Are Large Language Models Good Temporal Graph Learners?
Abstract:
Large Language Models (LLMs) have recently driven significant advancements in Natural Language Processing and various other applications. While a broad range of literature has explored the graph-reasoning capabilities of LLMs, including their use of predictors on graphs, the application of LLMs to dynamic graphs -- real world evolving networks -- remains relatively unexplored. Recent work studies synthetic temporal graphs generated by random graph models, but applying LLMs to real-world temporal graphs remains an open question. To address this gap, we introduce Temporal Graph Talker (TGTalker), a novel temporal graph learning framework designed for LLMs. TGTalker utilizes the recency bias in temporal graphs to extract relevant structural information, converted to natural language for LLMs, while leveraging temporal neighbors as additional information for prediction. TGTalker demonstrates competitive link prediction capabilities compared to existing Temporal Graph Neural Network (TGNN) models. Across five real-world networks, TGTalker performs competitively with state-of-the-art temporal graph methods while consistently outperforming popular models such as TGN and HTGN. Furthermore, TGTalker generates textual explanations for each prediction, thus opening up exciting new directions in explainability and interpretability for temporal link prediction. The code is publicly available at https://github.com/shenyangHuang/TGTalker.
中文: 本文提出的TGTalker创新框架使大语言模型能在真实时序图上实现具有竞争力的链接预测,并通过生成文本解释显著提升了模型的可解释性。
English: This paper introduces TGTalker, a novel framework that enables Large Language Models to perform competitive link prediction on real-world temporal graphs while generating textual explanations for enhanced interpretability.

Authors:Anirudh Bharadwaj, Chaitanya Malaviya, Nitish Joshi, Mark Yatskar
Title: Flattery, Fluff, and Fog: Diagnosing and Mitigating Idiosyncratic Biases in Preference Models
Abstract:
Language models serve as proxies for human preference judgements in alignment and evaluation, yet they exhibit systematic miscalibration, prioritizing superficial patterns over substantive qualities. This bias manifests as overreliance on features like length, structure, and style, leading to issues like reward hacking and unreliable evaluations. Evidence suggests these biases originate in artifacts in human training data. In this work, we systematically investigate the relationship between training data biases and preference model miscalibration across five idiosyncratic features of language model generations: length, structure, jargon, sycophancy and vagueness. Using controlled counterfactual pairs, we first quantify the extent to which preference models favor responses with magnified biases (skew), finding this preference occurs in >60% of instances, and model preferences show high miscalibration (~40%) compared to human preferences. Notably, bias features only show mild negative correlations to human preference labels (mean r_human = -0.12) but show moderately strong positive correlations with labels from a strong reward model (mean r_model = +0.36), suggesting that models may overrely on spurious cues. To mitigate these issues, we propose a simple post-training method based on counterfactual data augmentation (CDA) using synthesized contrastive examples. Finetuning models with CDA reduces average miscalibration from 39.4% to 32.5% and average absolute skew difference from 20.5% to 10.0%, while maintaining overall RewardBench performance, showing that targeted debiasing is effective for building reliable preference models.
中文: 语言模型在偏好评估中表现出系统性偏差,过度依赖长度、结构等表面特征,但通过反事实数据增强的微调方法,可将平均误校准率从39.4%降至32.5%,同时保持整体性能,有效提升模型可靠性。
English: Language models exhibit miscalibration by favoring superficial features like length and style over substantive qualities, but this can be mitigated through counterfactual data augmentation, reducing miscalibration by nearly 7% and skew by over 10% while preserving performance.

Authors:Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Title: Search Arena: Analyzing Search-Augmented LLMs
Abstract:
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model's parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.
中文: 本研究推出了Search Arena这一大规模人类偏好数据集,用于评估搜索增强语言模型,发现用户偏好受引用数量和来源类型影响,交叉分析表明网络搜索能提升非搜索场景性能,但仅依赖参数知识会显著降低搜索密集型任务的质量。
English: This study introduces Search Arena, a large-scale human-preference dataset for evaluating search-augmented language models, revealing that user preferences are influenced by citation quantity and source type, while cross-analysis shows web search enhances performance in non-search settings but reliance on parametric knowledge alone degrades search-intensive results.

Authors:Ranajoy Sadhukhan, Zhuoming Chen, Haizhong Zheng, Yang Zhou, Emma Strubell, Beidi Chen
Title: Kinetics: Rethinking Test-Time Scaling Laws
Abstract:
We rethink test-time scaling laws from a practical efficiency perspective, revealing that the effectiveness of smaller models is significantly overestimated. Prior work, grounded in compute-optimality, overlooks critical memory access bottlenecks introduced by inference-time strategies (e.g., Best-of-$N$, long CoTs). Our holistic analysis, spanning models from 0.6B to 32B parameters, reveals a new Kinetics Scaling Law that better guides resource allocation by incorporating both computation and memory access costs. Kinetics Scaling Law suggests that test-time compute is more effective when used on models above a threshold than smaller ones. A key reason is that in TTS, attention, rather than parameter count, emerges as the dominant cost factor. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which lowers per-token cost and enables longer generations and more parallel samples within the same resource budget. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. These results suggest that sparse attention is essential and increasingly important with more computing invested, for realizing the full potential of test-time scaling where, unlike training, accuracy has yet to saturate as a function of computation, and continues to improve through increased generation. The code is available at https://github.com/Infini-AI-Lab/Kinetics.
中文摘要:本研究提出动力学缩放定律,揭示因忽略内存瓶颈而高估了小模型效能,并通过稀疏注意力机制降低单令牌成本、支持生成长文本,从而优化资源分配。
English Summary: This study introduces the Kinetics Scaling Law, which demonstrates that smaller models' effectiveness is overestimated due to overlooked memory bottlenecks, and proposes sparse attention to optimize resource allocation by reducing per-token costs and enabling longer generations.

Authors:Nathan Herr, Tim Rocktäschel, Roberta Raileanu
Title: LLM-First Search: Self-Guided Exploration of the Solution Space
Abstract:
Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts' Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.
Chinese Summary: LLM优先搜索(LFS)是一种新型的自引导搜索方法,它通过大语言模型的内部评分机制自主控制搜索过程,无需预定义策略,并在复杂推理任务中展现出更优的性能、效率和可扩展性。
English Summary: LLM-First Search (LFS) is a novel self-guided search method that enables large language models to autonomously control the search process through internal scoring, eliminating the need for predefined strategies and demonstrating superior performance, efficiency, and scalability on challenging reasoning tasks.

Authors:Moritz Miller, Bernhard Schölkopf, Siyuan Guo
Title: Counterfactual reasoning: an analysis of in-context emergence
Abstract:
Large-scale neural language models (LMs) exhibit remarkable performance in in-context learning: the ability to learn and reason the input context on the fly without parameter update. This work studies in-context counterfactual reasoning in language models, that is, to predict the consequences of changes under hypothetical scenarios. We focus on studying a well-defined synthetic setup: a linear regression task that requires noise abduction, where accurate prediction is based on inferring and copying the contextual noise from factual observations. We show that language models are capable of counterfactual reasoning in this controlled setup and provide insights that counterfactual reasoning for a broad class of functions can be reduced to a transformation on in-context observations; we find self-attention, model depth, and data diversity in pre-training drive performance in Transformers. More interestingly, our findings extend beyond regression tasks and show that Transformers can perform noise abduction on sequential data, providing preliminary evidence on the potential for counterfactual story generation. Our code is available under https://github.com/moXmiller/counterfactual-reasoning.git .
中文: 研究表明大规模语言模型能够通过噪声溯因在线性回归任务中进行上下文反事实推理,其性能受自注意力机制、模型深度和预训练数据多样性影响,并显示出在反事实故事生成等领域的应用潜力。
English: This research demonstrates that large-scale language models can perform in-context counterfactual reasoning through noise abduction in linear regression tasks, with performance driven by self-attention mechanisms, model depth, and pre-training data diversity, showing potential for applications like counterfactual story generation.

Authors:Yeonseok Jeong, Jinsu Kim, Dohyeon Lee, Seung-won Hwang
Title: ECoRAG: Evidentiality-guided Compression for Long Context RAG
Abstract:
Large Language Models (LLMs) have shown remarkable performance in Open-Domain Question Answering (ODQA) by leveraging external documents through Retrieval-Augmented Generation (RAG). To reduce RAG overhead, from longer context, context compression is necessary. However, prior compression methods do not focus on filtering out non-evidential information, which limit the performance in LLM-based RAG. We thus propose Evidentiality-guided RAG, or ECoRAG framework. ECoRAG improves LLM performance by compressing retrieved documents based on evidentiality, ensuring whether answer generation is supported by the correct evidence. As an additional step, ECoRAG reflects whether the compressed content provides sufficient evidence, and if not, retrieves more until sufficient. Experiments show that ECoRAG improves LLM performance on ODQA tasks, outperforming existing compression methods. Furthermore, ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage by retaining only the necessary information to generate the correct answer. Code is available at https://github.com/ldilab/ECoRAG.
Chinese: ECoRAG框架通过基于证据性压缩检索文档并在证据不足时迭代补充内容,显著提升大语言模型在开放域问答中的性能,相比现有方法在保证准确性的同时实现了更高的成本效益。
English: The ECoRAG framework enhances LLM performance in Open-Domain Question Answering by compressing retrieved documents based on evidentiality and iteratively retrieving additional content if evidence is insufficient, achieving both higher accuracy and cost efficiency compared to existing methods.

Authors:Chenyu Lin, Yilin Wen, Du Su, Fei Sun, Muhan Chen, Chenfu Bao, Zhonghou Lv
Title: Knowledgeable-r1: Policy Optimization for Knowledge Exploration in Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) is a mainstream method for improving performance on knowledge-intensive tasks. However,current RAG systems often place too much emphasis on retrieved contexts. This can lead to reliance on inaccurate sources and overlook the model's inherent knowledge, especially when dealing with misleading or excessive information. To resolve this imbalance, we propose Knowledgeable-r1 that using joint sampling and define multi policy distributions in knowledge capability exploration to stimulate large language models'self-integrated utilization of parametric and contextual knowledge. Experiments show that Knowledgeable-r1 significantly enhances robustness and reasoning accuracy in both parameters and contextual conflict tasks and general RAG tasks, especially outperforming baselines by 17.07% in counterfactual scenarios and demonstrating consistent gains across RAG tasks. Our code are available at https://github.com/lcy80366872/ knowledgeable-r1.
中文摘要:Knowledgeable-R1是一个强化学习框架,通过训练大语言模型利用参数化知识来抵抗误导性检索信息,在知识冲突场景中显著提升了鲁棒性和推理准确性。
English Summary: Knowledgeable-R1 is a reinforcement learning framework that trains large language models to resist misleading retrieved information by leveraging their parametric knowledge, significantly improving robustness and accuracy in knowledge conflict scenarios.

Authors:Noy Sternlicht, Ariel Gera, Roy Bar-Haim, Tom Hope, Noam Slonim
Title: Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
Abstract:
We introduce Debate Speech Evaluation as a novel and challenging benchmark for assessing LLM judges. Evaluating debate speeches requires a deep understanding of the speech at multiple levels, including argument strength and relevance, the coherence and organization of the speech, the appropriateness of its style and tone, and so on. This task involves a unique set of cognitive abilities that previously received limited attention in systematic LLM benchmarking. To explore such skills, we leverage a dataset of over 600 meticulously annotated debate speeches and present the first in-depth analysis of how state-of-the-art LLMs compare to human judges on this task. Our findings reveal a nuanced picture: while larger models can approximate individual human judgments in some respects, they differ substantially in their overall judgment behavior. We also investigate the ability of frontier LLMs to generate persuasive, opinionated speeches, showing that models may perform at a human level on this task.
Chinese: 我们提出辩论演讲评估作为测试大语言模型评判能力的新基准,发现先进模型虽能在某些方面接近人类判断水平,但其整体评估行为与人类存在显著差异。
English: We propose Debate Speech Evaluation as a new benchmark to test LLM judges' ability to assess complex aspects like argument strength and coherence, revealing that while advanced models can match human judgments in certain areas, their overall evaluation behavior differs significantly.

Authors:Zhenran Xu, Xue Yang, Yiyu Wang, Qingli Hu, Zijiao Wu, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: ComfyUI-Copilot: An Intelligent Assistant for Automated Workflow Development
Abstract:
We introduce ComfyUI-Copilot, a large language model-powered plugin designed to enhance the usability and efficiency of ComfyUI, an open-source platform for AI-driven art creation. Despite its flexibility and user-friendly interface, ComfyUI can present challenges to newcomers, including limited documentation, model misconfigurations, and the complexity of workflow design. ComfyUI-Copilot addresses these challenges by offering intelligent node and model recommendations, along with automated one-click workflow construction. At its core, the system employs a hierarchical multi-agent framework comprising a central assistant agent for task delegation and specialized worker agents for different usages, supported by our curated ComfyUI knowledge bases to streamline debugging and deployment. We validate the effectiveness of ComfyUI-Copilot through both offline quantitative evaluations and online user feedback, showing that it accurately recommends nodes and accelerates workflow development. Additionally, use cases illustrate that ComfyUI-Copilot lowers entry barriers for beginners and enhances workflow efficiency for experienced users. The ComfyUI-Copilot installation package and a demo video are available at https://github.com/AIDC-AI/ComfyUI-Copilot.
中文: ComfyUI-Copilot是一款基于大语言模型的插件,通过智能节点推荐和自动化工作流构建提升ComfyUI的易用性,经实践验证能有效降低新手门槛并提高工作效率。
English: ComfyUI-Copilot is an AI-powered plugin that enhances ComfyUI's usability by providing intelligent recommendations and automated workflow construction, validated through evaluations to lower entry barriers and boost efficiency.

Authors:Shiyi Xu, Yiwen Hu, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Ji-Rong Wen
Title: ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests
Abstract:
With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real competition environments. Moreover, current evaluation metrics such as Pass@K fail to capture the reflective abilities of reasoning models. To address these challenges, we propose \textbf{ICPC-Eval}, a top-level competitive coding benchmark designed to probing the frontiers of LLM reasoning. ICPC-Eval includes 118 carefully curated problems from 11 recent ICPC contests held in various regions of the world, offering three key contributions: 1) A challenging realistic ICPC competition scenario, featuring a problem type and difficulty distribution consistent with actual contests. 2) A robust test case generation method and a corresponding local evaluation toolkit, enabling efficient and accurate local evaluation. 3) An effective test-time scaling evaluation metric, Refine@K, which allows iterative repair of solutions based on execution feedback. The results underscore the significant challenge in evaluating complex reasoning abilities: top-tier reasoning models like DeepSeek-R1 often rely on multi-turn code feedback to fully unlock their in-context reasoning potential when compared to non-reasoning counterparts. Furthermore, despite recent advancements in code generation, these models still lag behind top-performing human teams. We release the benchmark at: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs
中文:现有基准和指标难以在竞赛环境中充分评估大型语言模型的编程能力,因此我们提出了ICPC-Eval这一顶级编程竞赛基准,通过真实题目和新颖的Refine@K指标揭示模型需依赖迭代反馈才能接近人类水平的表现。
English: Existing benchmarks and metrics inadequately assess LLMs' coding skills in competitive settings, prompting the introduction of ICPC-Eval, a challenging benchmark with realistic problems and a novel Refine@K metric that reveals models' reliance on iterative feedback to approach human-level performance.

Authors:Changyue Wang, Weihang Su, Qingyao Ai, Yiqun Liu
Title: Joint Evaluation of Answer and Reasoning Consistency for Hallucination Detection in Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) extend large language models with explicit, multi-step reasoning traces to enhance transparency and performance on complex tasks. However, these reasoning traces can be redundant or logically inconsistent, making them a new source of hallucination that is difficult to detect. Existing hallucination detection methods focus primarily on answer-level uncertainty and often fail to detect hallucinations or logical inconsistencies arising from the model's reasoning trace. This oversight is particularly problematic for LRMs, where the explicit thinking trace is not only an important support to the model's decision-making process but also a key source of potential hallucination. To this end, we propose RACE (Reasoning and Answer Consistency Evaluation), a novel framework specifically tailored for hallucination detection in LRMs. RACE operates by extracting essential reasoning steps and computing four diagnostic signals: inter-sample consistency of reasoning traces, entropy-based answer uncertainty, semantic alignment between reasoning and answers, and internal coherence of reasoning. This joint analysis enables fine-grained hallucination detection even when the final answer appears correct. Experiments across datasets and different LLMs demonstrate that RACE outperforms existing hallucination detection baselines, offering a robust and generalizable solution for evaluating LRMs. Our code is available at: https://github.com/bebr2/RACE.
中文摘要:RACE框架通过评估推理轨迹的一致性和答案的语义对齐,专门用于检测大型推理模型中的幻觉问题,即使在最终答案正确的情况下也能有效识别逻辑不一致性,优于现有检测方法。
English Summary: The RACE framework is introduced to detect hallucinations in Large Reasoning Models by evaluating reasoning trace consistency and answer alignment, outperforming existing methods in identifying logical inconsistencies even when final answers are correct.

Authors:Dingdong Wang, Jincenzi Wu, Junan Li, Dongchao Yang, Xueyuan Chen, Tianhua Zhang, Helen Meng
Title: MMSU: A Massive Multi-task Spoken Language Understanding and Reasoning Benchmark
Abstract:
Speech inherently contains rich acoustic information that extends far beyond the textual language. In real-world spoken language understanding, effective interpretation often requires integrating semantic meaning (e.g., content), paralinguistic features (e.g., emotions, speed, pitch) and phonological characteristics (e.g., prosody, intonation, rhythm), which are embedded in speech. While recent multimodal Speech Large Language Models (SpeechLLMs) have demonstrated remarkable capabilities in processing audio information, their ability to perform fine-grained perception and complex reasoning in natural speech remains largely unexplored. To address this gap, we introduce MMSU, a comprehensive benchmark designed specifically for understanding and reasoning in spoken language. MMSU comprises 5,000 meticulously curated audio-question-answer triplets across 47 distinct tasks. To ground our benchmark in linguistic theory, we systematically incorporate a wide range of linguistic phenomena, including phonetics, prosody, rhetoric, syntactics, semantics, and paralinguistics. Through a rigorous evaluation of 14 advanced SpeechLLMs, we identify substantial room for improvement in existing models, highlighting meaningful directions for future optimization. MMSU establishes a new standard for comprehensive assessment of spoken language understanding, providing valuable insights for developing more sophisticated human-AI speech interaction systems. MMSU benchmark is available at https://huggingface.co/datasets/ddwang2000/MMSU. Evaluation Code is available at https://github.com/dingdongwang/MMSU_Bench.
中文: 语音蕴含超越文本的丰富声学信息,为此我们推出了MMSU基准,旨在全面评估并推动多模态语音大语言模型在包含多种语言特征的语音理解与推理能力上的发展。
English: Speech conveys rich acoustic information beyond text, and the MMSU benchmark is introduced to evaluate and advance multimodal SpeechLLMs' understanding and reasoning in spoken language across diverse linguistic features.

Authors:Gio Paik, Geewook Kim, Jinbae Im
Title: MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models
Abstract:
This paper introduces MMRefine, a MultiModal Refinement benchmark designed to evaluate the error refinement capabilities of Multimodal Large Language Models (MLLMs). As the emphasis shifts toward enhancing reasoning during inference, MMRefine provides a framework that evaluates MLLMs' abilities to detect and correct errors across six distinct scenarios beyond just comparing final accuracy before and after refinement. Furthermore, the benchmark analyzes the refinement performance by categorizing errors into six error types. Experiments with various open and closed MLLMs reveal bottlenecks and factors impeding refinement performance, highlighting areas for improvement in effective reasoning enhancement. Our code and dataset are publicly available at https://github.com/naver-ai/MMRefine.
中文: 本文介绍MMRefine基准,用于评估多模态大语言模型在六种场景和错误类型中的检测与修正能力,揭示了当前性能瓶颈和改进方向。
English: This paper presents MMRefine, a benchmark for evaluating multimodal large language models' ability to detect and correct errors across six scenarios and error types, identifying current performance bottlenecks and improvement areas.

Authors:Juhyun Oh, Eunsu Kim, Alice Oh
Title: Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
Abstract:
Real-world planning problems require constant adaptation to changing requirements and balancing of competing constraints. However, current benchmarks for evaluating LLMs' planning capabilities primarily focus on static, single-turn scenarios. We introduce Flex-TravelPlanner, a benchmark that evaluates language models' ability to reason flexibly in dynamic planning scenarios. Building on the TravelPlanner dataset~\citep{xie2024travelplanner}, we introduce two novel evaluation settings: (1) sequential constraint introduction across multiple turns, and (2) scenarios with explicitly prioritized competing constraints. Our analysis of GPT-4o and Llama 3.1 70B reveals several key findings: models' performance on single-turn tasks poorly predicts their ability to adapt plans across multiple turns; constraint introduction order significantly affects performance; and models struggle with constraint prioritization, often incorrectly favoring newly introduced lower priority preferences over existing higher-priority constraints. These findings highlight the importance of evaluating LLMs in more realistic, dynamic planning scenarios and suggest specific directions for improving model performance on complex planning tasks. The code and dataset for our framework are publicly available at https://github.com/juhyunohh/FlexTravelBench.
Chinese: Flex-TravelPlanner 是一个评估语言模型动态规划能力的新基准,通过多轮交互和竞争性约束测试,揭示了模型在计划调整和约束优先级处理方面的显著不足。
English: Flex-TravelPlanner is a new benchmark that evaluates language models' dynamic planning abilities through multi-turn scenarios and competing constraints, revealing their limitations in adapting plans and prioritizing constraints effectively.

Authors:Qingchuan Li, Jiatong Li, Zirui Liu, Mingyue Cheng, Yuting Zeng, Qi Liu, Tongxuan Liu
Title: Are LLMs Reliable Translators of Logical Reasoning Across Lexically Diversified Contexts?
Abstract:
Neuro-symbolic approaches combining large language models (LLMs) with solvers excels in logical reasoning problems need long reasoning chains. In this paradigm, LLMs serve as translators, converting natural language reasoning problems into formal logic formulas. Then reliable symbolic solvers return correct solutions. Despite their success, we find that LLMs, as translators, struggle to handle lexical diversification, a common linguistic phenomenon, indicating that LLMs as logic translators are unreliable in real-world scenarios. Moreover, existing logical reasoning benchmarks lack lexical diversity, failing to challenge LLMs' ability to translate such text and thus obscuring this issue. In this work, we propose SCALe, a benchmark designed to address this significant gap through **logic-invariant lexical diversification**. By using LLMs to transform original benchmark datasets into lexically diversified but logically equivalent versions, we evaluate LLMs' ability to consistently map diverse expressions to uniform logical symbols on these new datasets. Experiments using SCALe further confirm that current LLMs exhibit deficiencies in this capability. Building directly on the deficiencies identified through our benchmark, we propose a new method, MenTaL, to address this limitation. This method guides LLMs to first construct a table unifying diverse expressions before performing translation. Applying MenTaL through in-context learning and supervised fine-tuning (SFT) significantly improves the performance of LLM translators on lexically diversified text. Our code is now available at https://github.com/wufeiwuwoshihua/LexicalDiver.
中文: 神经符号方法中,大型语言模型作为逻辑翻译器难以处理词汇多样化问题,导致实际应用不可靠,因此我们提出了SCALe基准和MenTaL方法,以提升模型将多样化表达映射到统一逻辑符号的一致性能力。
English: Neuro-symbolic approaches using LLMs as logic translators struggle with lexical diversification, leading to unreliable performance in real-world scenarios, prompting the creation of the SCALe benchmark and MenTaL method to enhance LLMs' consistency in mapping varied expressions to logical symbols.

Authors:K. O. T. Erziev
Title: BSBench: will your LLM find the largest prime number?
Abstract:
We propose that benchmarking LLMs on questions which have no reasonable answer actually isn't as silly as it sounds. We also present a benchmark that allows such testing and a method to modify the existing datasets, and discover that existing models demonstrate a performance far from the perfect on such questions. Our code and data artifacts are available at https://github.com/L3G5/impossible-bench
中文摘要:对无法回答的问题进行大语言模型基准测试并非无意义,研究发现现有模型在此类问题上表现远未完善。
English Summary: Benchmarking LLMs on unanswerable questions proves insightful, revealing significant performance gaps despite seeming counterintuitive.

Authors:Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Abstract:
Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.
中文摘要:大型语言模型的水印技术会削弱其真实性、安全性和实用性,但提出的对齐重采样方法能在保持水印可检测性的同时有效恢复这些对齐特性。
English Summary: Watermarking techniques in large language models can degrade their truthfulness, safety, and helpfulness, but the proposed Alignment Resampling method effectively restores these alignment properties while maintaining watermark detectability.

Authors:Qingfei Zhao, Ruobing Wang, Dingling Xu, Daren Zha, Limin Liu
Title: R-Search: Empowering LLM Reasoning with Search via Multi-Reward Reinforcement Learning
Abstract:
Large language models (LLMs) have notably progressed in multi-step and long-chain reasoning. However, extending their reasoning capabilities to encompass deep interactions with search remains a non-trivial challenge, as models often fail to identify optimal reasoning-search interaction trajectories, resulting in suboptimal responses. We propose R-Search, a novel reinforcement learning framework for Reasoning-Search integration, designed to enable LLMs to autonomously execute multi-step reasoning with deep search interaction, and learn optimal reasoning search interaction trajectories via multi-reward signals, improving response quality in complex logic- and knowledge-intensive tasks. R-Search guides the LLM to dynamically decide when to retrieve or reason, while globally integrating key evidence to enhance deep knowledge interaction between reasoning and search. During RL training, R-Search provides multi-stage, multi-type rewards to jointly optimize the reasoning-search trajectory. Experiments on seven datasets show that R-Search outperforms advanced RAG baselines by up to 32.2% (in-domain) and 25.1% (out-of-domain). The code and data are available at https://github.com/QingFei1/R-Search.
中文: R-Search是一种强化学习框架,通过多奖励信号优化让大语言模型自主执行多步推理与深度搜索交互,在复杂任务中相比现有方法性能提升最高达32.2%。
English: R-Search is a reinforcement learning framework that enables large language models to autonomously integrate multi-step reasoning with deep search interactions through multi-reward optimization, significantly improving performance on complex tasks by up to 32.2% compared to existing methods.

Authors:Anhao Zhao, Fanghua Ye, Yingqi Fan, Junlong Tong, Zhiwei Fei, Hui Su, Xiaoyu Shen
Title: SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling
Abstract:
Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies, but conventional static pruning methods overlook two critical dynamics inherent to LLM inference: (1) horizontal dynamics, where token-level heterogeneity demands context-aware pruning decisions, and (2) vertical dynamics, where the distinct functional roles of MLP and self-attention layers necessitate component-specific pruning policies. We introduce SkipGPT, a dynamic layer pruning framework designed to optimize computational resource allocation through two core innovations: (1) global token-aware routing to prioritize critical tokens, and (2) decoupled pruning policies for MLP and self-attention components. To mitigate training instability, we propose a two-stage optimization paradigm: first, a disentangled training phase that learns routing strategies via soft parameterization to avoid premature pruning decisions, followed by parameter-efficient LoRA fine-tuning to restore performance impacted by layer removal. Extensive experiments demonstrate that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model across benchmarks. By harmonizing dynamic efficiency with preserved expressivity, SkipGPT advances the practical deployment of scalable, resource-aware LLMs. Our code is publicly available at: https://github.com/EIT-NLP/SkipGPT.
Chinese: SkipGPT提出了一种动态层剪枝框架,通过全局令牌感知路由和解耦剪枝策略,在减少超过40%参数的同时保持或超越原始模型的性能表现。
English: SkipGPT introduces a dynamic layer pruning framework that reduces computational costs by over 40% while maintaining or improving performance through token-aware routing and component-specific policies.

Authors:Kejian Zhu, Shangqing Tu, Zhuoran Jin, Lei Hou, Juanzi Li, Jun Zhao
Title: Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Abstract:
The development of large language models (LLMs) depends on trustworthy evaluation. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical. In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through comparative and causal analysis. Building on this, we introduce an evaluation method called shortcut neuron patching to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient ($ρ$) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. Code: https://github.com/GaryStack/Trustworthy-Evaluation
中文摘要:本研究通过比较和因果分析识别捷径神经元,提出了一种捷径神经元修补方法,有效缓解大语言模型评估中的数据污染问题,并与可信基准显示出高度相关性。
English Summary: This study addresses data contamination in large language model evaluations by identifying shortcut neurons through comparative and causal analysis, proposing a shortcut neuron patching method that effectively mitigates contamination and demonstrates strong correlation with trustworthy benchmarks.

Authors:Disha Sheshanarayana, Tanishka Magar, Ayushi Mittal, Neelam Chaplot
Title: CLAIM: An Intent-Driven Multi-Agent Framework for Analyzing Manipulation in Courtroom Dialogues
Abstract:
Courtrooms are places where lives are determined and fates are sealed, yet they are not impervious to manipulation. Strategic use of manipulation in legal jargon can sway the opinions of judges and affect the decisions. Despite the growing advancements in NLP, its application in detecting and analyzing manipulation within the legal domain remains largely unexplored. Our work addresses this gap by introducing LegalCon, a dataset of 1,063 annotated courtroom conversations labeled for manipulation detection, identification of primary manipulators, and classification of manipulative techniques, with a focus on long conversations. Furthermore, we propose CLAIM, a two-stage, Intent-driven Multi-agent framework designed to enhance manipulation analysis by enabling context-aware and informed decision-making. Our results highlight the potential of incorporating agentic frameworks to improve fairness and transparency in judicial processes. We hope that this contributes to the broader application of NLP in legal discourse analysis and the development of robust tools to support fairness in legal decision-making. Our code and data are available at https://github.com/Disha1001/CLAIM.
中文摘要:本研究提出了LegalCon数据集用于检测法庭对话中的操纵行为,并开发了CLAIM多智能体框架,通过先进自然语言处理技术提升司法过程的公平性与透明度。
English Summary: This research introduces LegalCon, a dataset for detecting manipulation in courtroom conversations, and proposes CLAIM, a multi-agent framework to enhance legal fairness and transparency through advanced NLP analysis.

Authors:Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin
Title: TextAtari: 100K Frames Game Playing with Language Agents
Abstract:
We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning. Our code is available at https://github.com/Lww007/Text-Atari-Agents.
中文: TextAtari将经典Atari游戏转化为文本环境,构建了评估语言智能体在长跨度决策任务中表现的基准,揭示了AI模型与人类在序列推理和战略规划方面存在的显著差距。
English: TextAtari is a benchmark that converts Atari games into text-based environments to evaluate language agents on long-horizon decision-making tasks, revealing significant performance gaps between AI models and humans in sequential reasoning and strategic planning.

Authors:Anastasiia Ivanova, Eva Bakaeva, Zoya Volovikova, Alexey K. Kovalev, Aleksandr I. Panov
Title: AmbiK: Dataset of Ambiguous Tasks in Kitchen Environment
Abstract:
As a part of an embodied agent, Large Language Models (LLMs) are typically used for behavior planning given natural language instructions from the user. However, dealing with ambiguous instructions in real-world environments remains a challenge for LLMs. Various methods for task ambiguity detection have been proposed. However, it is difficult to compare them because they are tested on different datasets and there is no universal benchmark. For this reason, we propose AmbiK (Ambiguous Tasks in Kitchen Environment), the fully textual dataset of ambiguous instructions addressed to a robot in a kitchen environment. AmbiK was collected with the assistance of LLMs and is human-validated. It comprises 1000 pairs of ambiguous tasks and their unambiguous counterparts, categorized by ambiguity type (Human Preferences, Common Sense Knowledge, Safety), with environment descriptions, clarifying questions and answers, user intents, and task plans, for a total of 2000 tasks. We hope that AmbiK will enable researchers to perform a unified comparison of ambiguity detection methods. AmbiK is available at https://github.com/cog-model/AmbiK-dataset.
中文摘要:AmbiK数据集被提出作为一个通用基准,旨在解决在具身智能体中比较大型语言模型模糊指令检测方法的难题,该数据集包含1000对经过人工验证的厨房环境模糊与非模糊任务。
English Summary: The AmbiK dataset is introduced as a universal benchmark to address the challenge of comparing ambiguity detection methods for LLMs in embodied agents, featuring 1000 pairs of ambiguous and unambiguous kitchen tasks with human-validated annotations.

Authors:Ming Zhang, Yujiong Shen, Zelin Li, Huayu Sha, Binze Hu, Yuhui Wang, Chenhao Huang, Shichun Liu, Jingqi Tong, Changhao Jiang, Mingxu Chai, Zhiheng Xi, Shihan Dou, Tao Gui, Qi Zhang, Xuanjing Huang
Title: LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation
Abstract:
Evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error. Current medical benchmarks have three main types: medical exam-based, comprehensive medical, and specialized assessments. However, these benchmarks have limitations in question design (mostly multiple-choice), data sources (often not derived from real clinical scenarios), and evaluation methods (poor assessment of complex reasoning). To address these issues, we present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios. We also design an automated evaluation pipeline, incorporating expert-developed checklists into our LLM-as-Judge framework. Furthermore, our methodology validates machine scoring through human-machine agreement analysis, dynamically refining checklists and prompts based on expert feedback to ensure reliability. We evaluate 13 LLMs across three categories (specialized medical models, open-source models, and closed-source models) on LLMEval-Med, providing valuable insights for the safe and effective deployment of LLMs in medical domains. The dataset is released in https://github.com/llmeval/LLMEval-Med.
中文: 本文提出LLMEval-Med基准,基于真实电子病历构建了覆盖五大医疗领域的评估体系,通过自动化专家核对框架对13种大语言模型进行测试,为医疗领域安全应用提供关键洞见。
English: This paper introduces LLMEval-Med, a comprehensive medical benchmark developed from real clinical data to address limitations in existing evaluations by incorporating automated scoring with expert-validated checklists, assessing 13 LLMs across five medical domains for safer deployment.

Authors:Yi Zhao, Siqi Wang, Jing Li
Title: LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
Abstract:
Navigation instruction generation for visually impaired (VI) individuals (NIG-VI) is critical yet relatively underexplored. This study, hence, focuses on producing precise, in-situ, step-by-step navigation instructions that are practically usable by VI users. Concretely, we propose LaF-GRPO (LLM-as-Follower GRPO), where an LLM simulates VI user responses to generate rewards guiding the Vision-Language Model (VLM) post-training. This enhances instruction usability while reducing costly real-world data needs. To facilitate training and testing, we introduce NIG4VI, a 27k-sample open-sourced benchmark. It provides diverse navigation scenarios with accurate spatial coordinates, supporting detailed, open-ended in-situ instruction generation. Experiments on NIG4VI show the effectiveness of LaF-GRPO by quantitative metrics (e.g., Zero-(LaF-GRPO) boosts BLEU +14\%; SFT+(LaF-GRPO) METEOR 0.542 vs. GPT-4o's 0.323) and yields more intuitive, safer instructions. Code and benchmark are available at \href{https://github.com/YiyiyiZhao/NIG4VI}{https://github.com/YiyiyiZhao/NIG4VI}.
中文: 本研究提出LaF-GRPO方法,通过大语言模型模拟视障用户反馈来优化视觉语言模型,生成精确的实时导航指引,并在NIG4VI基准测试中验证了该方法在实用性和安全性指标上的显著提升。
English: This study introduces LaF-GRPO, a method using LLM-simulated visually impaired user feedback to enhance vision-language models for generating precise, in-situ navigation instructions, validated by the new NIG4VI benchmark showing significant improvements in usability and safety metrics.

Authors:Dan Oneata, Leanne Nortje, Yevgen Matusevych, Herman Kamper
Title: The mutual exclusivity bias of bilingual visually grounded speech models
Abstract:
Mutual exclusivity (ME) is a strategy where a novel word is associated with a novel object rather than a familiar one, facilitating language learning in children. Recent work has found an ME bias in a visually grounded speech (VGS) model trained on English speech with paired images. But ME has also been studied in bilingual children, who may employ it less due to cross-lingual ambiguity. We explore this pattern computationally using bilingual VGS models trained on combinations of English, French, and Dutch. We find that bilingual models generally exhibit a weaker ME bias than monolingual models, though exceptions exist. Analyses show that the combined visual embeddings of bilingual models have a smaller variance for familiar data, partly explaining the increase in confusion between novel and familiar concepts. We also provide new insights into why the ME bias exists in VGS models in the first place. Code and data: https://github.com/danoneata/me-vgs
中文: 双语视觉语音模型比单语模型表现出更弱的互斥性偏好,部分原因是熟悉数据的视觉嵌入方差减小,导致新概念与熟悉概念之间的混淆增加。
English: Bilingual visually grounded speech models exhibit a weaker mutual exclusivity bias than monolingual ones, partly due to reduced variance in visual embeddings for familiar data, which increases confusion between novel and familiar concepts.

Authors:An Quang Tang, Xiuzhen Zhang, Minh Ngoc Dinh, Zhuang Li
Title: QQSUM: A Novel Task and Model of Quantitative Query-Focused Summarization for Review-based Product Question Answering
Abstract:
Review-based Product Question Answering (PQA) allows e-commerce platforms to automatically address customer queries by leveraging insights from user reviews. However, existing PQA systems generate answers with only a single perspective, failing to capture the diversity of customer opinions. In this paper we introduce a novel task Quantitative Query-Focused Summarization (QQSUM), which aims to summarize diverse customer opinions into representative Key Points (KPs) and quantify their prevalence to effectively answer user queries. While Retrieval-Augmented Generation (RAG) shows promise for PQA, its generated answers still fall short of capturing the full diversity of viewpoints. To tackle this challenge, our model QQSUM-RAG, which extends RAG, employs few-shot learning to jointly train a KP-oriented retriever and a KP summary generator, enabling KP-based summaries that capture diverse and representative opinions. Experimental results demonstrate that QQSUM-RAG achieves superior performance compared to state-of-the-art RAG baselines in both textual quality and quantification accuracy of opinions. Our source code is available at: https://github.com/antangrocket1312/QQSUMM
中文:本文提出了QQSUM这一新任务,通过将多样化的用户意见总结为代表性关键点并量化其普遍性,以改进基于评论的产品问答系统,所提出的QQSUM-RAG模型在文本质量和意见量化准确性方面均优于现有方法。
English: This paper introduces QQSUM, a novel task that enhances Review-based Product Question Answering by summarizing diverse customer opinions into representative Key Points and quantifying their prevalence, with the proposed QQSUM-RAG model outperforming existing methods in both textual quality and quantification accuracy.

Authors:Alex Laitenberger, Christopher D. Manning, Nelson F. Liu
Title: Stronger Baselines for Retrieval-Augmented Generation with Long-Context Language Models
Abstract:
With the rise of long-context language models (LMs) capable of processing tens of thousands of tokens in a single pass, do multi-stage retrieval-augmented generation (RAG) pipelines still offer measurable benefits over simpler, single-stage approaches? To assess this question, we conduct a controlled evaluation for QA tasks under systematically scaled token budgets, comparing two recent multi-stage pipelines, ReadAgent and RAPTOR, against three baselines, including DOS RAG (Document's Original Structure RAG), a simple retrieve-then-read method that preserves original passage order. Despite its straightforward design, DOS RAG consistently matches or outperforms more intricate methods on multiple long-context QA benchmarks. We recommend establishing DOS RAG as a simple yet strong baseline for future RAG evaluations, pairing it with emerging embedding and language models to assess trade-offs between complexity and effectiveness as model capabilities evolve.
中文:尽管长上下文语言模型兴起,但简单的DOS RAG方法在问答任务中始终与复杂多阶段流程持平或更优,建议将其作为未来检索增强生成评估的强基准。
English: Despite the emergence of long-context language models, the simple DOS RAG method consistently matches or surpasses complex multi-stage pipelines in QA tasks, establishing it as a strong baseline for future evaluations.

Authors:Takeshi Saga, Catherine Pelachaud
Title: Voice Activity Projection Model with Multimodal Encoders
Abstract:
Turn-taking management is crucial for any social interaction. Still, it is challenging to model human-machine interaction due to the complexity of the social context and its multimodal nature. Unlike conventional systems based on silence duration, previous existing voice activity projection (VAP) models successfully utilized a unified representation of turn-taking behaviors as prediction targets, which improved turn-taking prediction performance. Recently, a multimodal VAP model outperformed the previous state-of-the-art model by a significant margin. In this paper, we propose a multimodal model enhanced with pre-trained audio and face encoders to improve performance by capturing subtle expressions. Our model performed competitively, and in some cases, even better than state-of-the-art models on turn-taking metrics. All the source codes and pretrained models are available at https://github.com/sagatake/VAPwithAudioFaceEncoders.
Chinese: 本文提出了一种结合预训练音频和面部编码器的多模态模型,通过捕捉细微表情来提升对话轮次预测性能,在关键指标上表现优异,甚至超越了现有最优模型。
English: This paper introduces a multimodal model enhanced with pre-trained audio and face encoders that captures subtle expressions to improve turn-taking prediction, performing competitively or even surpassing state-of-the-art models on key metrics.

Authors:Chiwei Zhu, Benfeng Xu, Xiaorui Wang, Zhendong Mao
Title: From Real to Synthetic: Synthesizing Millions of Diversified and Complicated User Instructions with Attributed Grounding
Abstract:
The pursuit of diverse, complex, and large-scale instruction data is crucial for automatically aligning large language models (LLMs). While there are methods capable of generating synthetic instructions at scale, they either suffer from limited grounding sources, leading to a narrow distribution, or rely on trivial extensions that fail to produce meaningful trajectories in terms of complexity. In contrast, instructions that benefit efficient alignment are typically crafted with cognitive insights and grounded in real-world use cases. In this paper, we synthesize such instructions using attributed grounding, which involves 1) a top-down attribution process that grounds a selective set of real instructions to situated users, and 2) a bottom-up synthesis process that leverages web documents to first generate a situation, then a meaningful instruction. This framework allows us to harvest diverse and complex instructions at scale, utilizing the vast range of web documents. Specifically, we construct a dataset of 1 million instructions, called SynthQuestions, and demonstrate that models trained on it achieve leading performance on several common benchmarks, with improvements that continually scale with more web corpora. Data, models and codes will be available at https://github.com/Ignoramus0817/SynthQuestions.
中文: 本文提出SynthQuestions数据集,通过结合自上而下的用户归因和自下而上的网络文档合成的新颖归因基础框架,生成百万条指令以改进大语言模型对齐,在多个基准测试中取得了领先性能。
English: This paper introduces SynthQuestions, a dataset of one million instructions generated through a novel attributed grounding framework that combines top-down user-based attribution with bottom-up synthesis from web documents to enhance large language model alignment, achieving state-of-the-art performance on benchmarks.

Authors:Junnan Zhu, Jingyi Wang, Bohan Yu, Xiaoyu Wu, Junbo Li, Lei Wang, Nan Xu
Title: TableEval: A Real-World Benchmark for Complex, Multilingual, and Multi-Structured Table Question Answering
Abstract:
LLMs have shown impressive progress in natural language processing. However, they still face significant challenges in TableQA, where real-world complexities such as diverse table structures, multilingual data, and domain-specific reasoning are crucial. Existing TableQA benchmarks are often limited by their focus on simple flat tables and suffer from data leakage. Furthermore, most benchmarks are monolingual and fail to capture the cross-lingual and cross-domain variability in practical applications. To address these limitations, we introduce TableEval, a new benchmark designed to evaluate LLMs on realistic TableQA tasks. Specifically, TableEval includes tables with various structures (such as concise, hierarchical, and nested tables) collected from four domains (including government, finance, academia, and industry reports). Besides, TableEval features cross-lingual scenarios with tables in Simplified Chinese, Traditional Chinese, and English. To minimize the risk of data leakage, we collect all data from recent real-world documents. Considering that existing TableQA metrics fail to capture semantic accuracy, we further propose SEAT, a new evaluation framework that assesses the alignment between model responses and reference answers at the sub-question level. Experimental results have shown that SEAT achieves high agreement with human judgment. Extensive experiments on TableEval reveal critical gaps in the ability of state-of-the-art LLMs to handle these complex, real-world TableQA tasks, offering insights for future improvements. We make our dataset available here: https://github.com/wenge-research/TableEval.
Chinese: 作者提出了TableEval基准,通过整合多样化表格结构、多语言数据和真实领域内容来弥补现有TableQA系统的不足,并开发了与人类判断高度一致的SEAT评估框架。
English: The authors introduce TableEval, a comprehensive benchmark addressing limitations in existing TableQA systems by incorporating diverse table structures, multilingual data, and real-world domains, along with a new evaluation framework SEAT that better aligns with human judgment.

Authors:Junqi Gao, Xiang Zou, YIng Ai, Dong Li, Yichen Niu, Biqing Qi, Jianxing Liu
Title: Graph Counselor: Adaptive Graph Exploration via Multi-Agent Synergy to Enhance LLM Reasoning
Abstract:
Graph Retrieval Augmented Generation (GraphRAG) effectively enhances external knowledge integration capabilities by explicitly modeling knowledge relationships, thereby improving the factual accuracy and generation quality of Large Language Models (LLMs) in specialized domains. However, existing methods suffer from two inherent limitations: 1) Inefficient Information Aggregation: They rely on a single agent and fixed iterative patterns, making it difficult to adaptively capture multi-level textual, structural, and degree information within graph data. 2) Rigid Reasoning Mechanism: They employ preset reasoning schemes, which cannot dynamically adjust reasoning depth nor achieve precise semantic correction. To overcome these limitations, we propose Graph Counselor, an GraphRAG method based on multi-agent collaboration. This method uses the Adaptive Graph Information Extraction Module (AGIEM), where Planning, Thought, and Execution Agents work together to precisely model complex graph structures and dynamically adjust information extraction strategies, addressing the challenges of multi-level dependency modeling and adaptive reasoning depth. Additionally, the Self-Reflection with Multiple Perspectives (SR) module improves the accuracy and semantic consistency of reasoning results through self-reflection and backward reasoning mechanisms. Experiments demonstrate that Graph Counselor outperforms existing methods in multiple graph reasoning tasks, exhibiting higher reasoning accuracy and generalization ability. Our code is available at https://github.com/gjq100/Graph-Counselor.git.
中文摘要:Graph Counselor通过多智能体协作和自反思机制,克服了现有GraphRAG方法在信息聚合和推理机制上的局限性,实现了自适应信息提取和动态推理深度调整,在图推理任务中表现出更优性能。
English Summary: Graph Counselor overcomes the limitations of existing GraphRAG methods by employing multi-agent collaboration and self-reflection mechanisms to achieve adaptive information extraction and dynamic reasoning depth adjustment, demonstrating superior performance in graph reasoning tasks.

Authors:Junyi Chen, Shihao Bai, Zaijun Wang, Siyu Wu, Chuheng Du, Hailong Yang, Ruihao Gong, Shengzhong Liu, Fan Wu, Guihai Chen
Title: Pre$^3$: Enabling Deterministic Pushdown Automata for Faster Structured LLM Generation
Abstract:
Extensive LLM applications demand efficient structured generations, particularly for LR(1) grammars, to produce outputs in specified formats (e.g., JSON). Existing methods primarily parse LR(1) grammars into a pushdown automaton (PDA), leading to runtime execution overhead for context-dependent token processing, especially inefficient under large inference batches. To address these issues, we propose Pre$^3$ that exploits deterministic pushdown automata (DPDA) to optimize the constrained LLM decoding efficiency. First, by precomputing prefix-conditioned edges during the preprocessing, Pre$^3$ enables ahead-of-time edge analysis and thus makes parallel transition processing possible. Second, by leveraging the prefix-conditioned edges, Pre$^3$ introduces a novel approach that transforms LR(1) transition graphs into DPDA, eliminating the need for runtime path exploration and achieving edge transitions with minimal overhead. Pre$^3$ can be seamlessly integrated into standard LLM inference frameworks, reducing time per output token (TPOT) by up to 40% and increasing throughput by up to 36% in our experiments. Our code is available at https://github.com/ModelTC/lightllm.
中文: 提出的Pre$^3$方法通过将LR(1)文法转换为具有预计算前缀条件边的确定性下推自动机,优化了大型语言模型的解码效率,实现并行转换并降低运行时开销,使令牌生成速度提升最高达40%,吞吐量提高最高达36%。
English: The proposed Pre$^3$ method optimizes LLM decoding efficiency by transforming LR(1) grammars into deterministic pushdown automata with precomputed prefix-conditioned edges, enabling parallel transitions and reducing runtime overhead to achieve up to 40% faster token generation and 36% higher throughput.

Authors:Mingxuan Xia, Haobo Wang, Yixuan Li, Zewei Yu, Jindong Wang, Junbo Zhao, Runze Wu
Title: Prompt Candidates, then Distill: A Teacher-Student Framework for LLM-driven Data Annotation
Abstract:
Recently, Large Language Models (LLMs) have demonstrated significant potential for data annotation, markedly reducing the labor costs associated with downstream applications. However, existing methods mostly adopt an aggressive strategy by prompting LLM to determine a single gold label for each unlabeled sample. Due to the inherent uncertainty within LLMs, they often produce incorrect labels for difficult samples, severely compromising the data quality for downstream applications. Motivated by ambiguity aversion in human behaviors, we propose a novel candidate annotation paradigm wherein large language models are encouraged to output all possible labels when incurring uncertainty. To ensure unique labels are provided for downstream tasks, we develop a teacher-student framework CanDist that distills candidate annotations with a Small Language Model (SLM). We further provide a rigorous justification demonstrating that distilling candidate annotations from the teacher LLM offers superior theoretical guarantees compared to directly using single annotations. Extensive experiments across six text classification tasks validate the effectiveness of our proposed method. The source code is available at https://github.com/MingxuanXia/CanDist.
中文摘要:本文提出一种候选标注新范式,通过让大语言模型对不确定样本输出所有可能的标签,并设计名为CanDist的师生框架,利用小语言模型蒸馏这些候选标注,从而为下游任务提供更高质量的数据保障。
English Summary: This paper introduces a candidate annotation paradigm that leverages LLMs to generate multiple possible labels for uncertain samples, coupled with a teacher-student framework called CanDist that distills these annotations using a smaller language model to ensure data quality for downstream tasks.

Authors:Fabian Karl, Ansgar Scherp
Title: CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents
Abstract:
Publication databases rely on accurate metadata extraction from diverse web sources, yet variations in web layouts and data formats present challenges for metadata providers. This paper introduces CRAWLDoc, a new method for contextual ranking of linked web documents. Starting with a publication's URL, such as a digital object identifier, CRAWLDoc retrieves the landing page and all linked web resources, including PDFs, ORCID profiles, and supplementary materials. It embeds these resources, along with anchor texts and the URLs, into a unified representation. For evaluating CRAWLDoc, we have created a new, manually labeled dataset of 600 publications from six top publishers in computer science. Our method CRAWLDoc demonstrates a robust and layout-independent ranking of relevant documents across publishers and data formats. It lays the foundation for improved metadata extraction from web documents with various layouts and formats. Our source code and dataset can be accessed at https://github.com/FKarl/CRAWLDoc.
中文: CRAWLDoc是一种新颖的方法,通过出版物URL对链接网页文档进行上下文排序,实现了跨不同格式和出版商的、独立于网页布局的稳健元数据提取。
English: CRAWLDoc is a novel method that contextually ranks linked web documents from publication URLs, enabling robust and layout-independent metadata extraction across diverse formats and publishers.

Authors:Pei-Yun Lin, Yen-lung Tsai
Title: ScoreRAG: A Retrieval-Augmented Generation Framework with Consistency-Relevance Scoring and Structured Summarization for News Generation
Abstract:
This research introduces ScoreRAG, an approach to enhance the quality of automated news generation. Despite advancements in Natural Language Processing and large language models, current news generation methods often struggle with hallucinations, factual inconsistencies, and lack of domain-specific expertise when producing news articles. ScoreRAG addresses these challenges through a multi-stage framework combining retrieval-augmented generation, consistency relevance evaluation, and structured summarization. The system first retrieves relevant news documents from a vector database, maps them to complete news items, and assigns consistency relevance scores based on large language model evaluations. These documents are then reranked according to relevance, with low-quality items filtered out. The framework proceeds to generate graded summaries based on relevance scores, which guide the large language model in producing complete news articles following professional journalistic standards. Through this methodical approach, ScoreRAG aims to significantly improve the accuracy, coherence, informativeness, and professionalism of generated news articles while maintaining stability and consistency throughout the generation process. The code and demo are available at: https://github.com/peiyun2260/ScoreRAG.
Chinese: ScoreRAG通过结合检索增强生成、相关性评分和结构化摘要的多阶段框架,旨在减少自动新闻生成中的幻觉问题,提升准确性、连贯性和专业性。
English: ScoreRAG is a multi-stage framework that enhances automated news generation by integrating retrieval-augmented generation, relevance scoring, and structured summarization to reduce hallucinations and improve accuracy, coherence, and professionalism.

Authors:Zhepei Wei, Wei-Lin Chen, Xinyu Zhu, Yu Meng
Title: AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism
Abstract:
Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential token generation process, where each token must be generated before the next can be processed. This sequential dependency restricts the ability to fully leverage modern hardware's parallel processing capabilities. Existing methods like speculative decoding and layer skipping offer potential speedups but have notable drawbacks: speculative decoding relies on an auxiliary "drafter" model, which can be challenging to acquire and increases memory overhead, while layer skipping may introduce discrepancies in the outputs due to the missing key-value cache at skipped layers. In this work, we propose AdaDecode, which accelerates LLM decoding without requiring auxiliary models or changes to the original model parameters, while ensuring output consistency. AdaDecode leverages the insight that many tokens can accurately be generated at intermediate layers, as further layers often do not significantly alter predictions once the model reaches a certain confidence. By adaptively generating tokens at intermediate layers when confidence is high, AdaDecode enables the next token's computation to begin immediately. The remaining layer computations for early-predicted tokens are deferred and executed in parallel with subsequent tokens when needed, maximizing hardware utilization and reducing decoding latency. A final verification step ensures that early predictions match the results of standard autoregressive decoding, preserving output parity. Experiments across diverse generation tasks shows that AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup, while guaranteeing output parity with standard autoregressive decoding.
中文摘要:AdaDecode通过在高置信度时自适应地在中间层生成令牌,实现并行计算,在保证与标准自回归解码输出一致性的同时,将解码吞吐量最高提升1.73倍。
English Summary: AdaDecode accelerates LLM decoding by adaptively generating tokens at intermediate layers when confidence is high, enabling parallel computation and achieving up to 1.73x speedup while maintaining output consistency with standard autoregressive decoding.

Authors:Jie Sun, Junkang Wu, Jiancan Wu, Zhibo Zhu, Xingyu Lu, Jun Zhou, Lintao Ma, Xiang Wang
Title: Robust Preference Optimization via Dynamic Target Margins
Abstract:
The alignment of Large Language Models (LLMs) is crucial for ensuring their safety and reliability in practical applications. Direct Preference Optimization (DPO) has emerged as an efficient method that directly optimizes models using preference pairs, significantly reducing resource demands. However, the effectiveness of DPO heavily depends on the data quality, which is frequently compromised by noise. In this work, we propose $γ$-PO, a dynamic target margin preference optimization algorithm that adjust reward margins at the pairwise level. By introducing instance-specific margin calibration, $γ$-PO strategically prioritizes high-confidence pairs (those demonstrating higher reward margins) while suppressing potential noise from ambiguous pairs. Moreover, $γ$-PO is a plug-and-play method, compatible with variants of DPO that rely on reward margin between preference pairs. Across benchmarks such as AlpacaEval2 and Arena-Hard, $γ$-PO achieves an average 4.4\% improvement over other baselines, setting new benchmarks for state-of-the-art performance. Additionally, $γ$-PO requires minimal code changes and has a negligible impact on training efficiency, making it a robust solution for enhancing LLMs alignment. Our codes are available at \href{https://github.com/sunjie279/gammaPO}{https://github.com/sunjie279/gammaPO}.
中文: 本文提出$γ$-PO算法,通过动态调整奖励边界来优先处理高置信度数据对并抑制噪声,在保持训练效率的同时将模型对齐性能平均提升4.4%,为LLM对齐提供了即插即用的强化方案。
English: This paper introduces $γ$-PO, a dynamic target margin preference optimization algorithm that enhances LLM alignment by calibrating reward margins to prioritize high-confidence pairs and suppress noise, achieving a 4.4% average improvement across benchmarks with minimal efficiency impact.

Authors:Zhuohao Yu, Jiali Zeng, Weizheng Gu, Yidong Wang, Jindong Wang, Fandong Meng, Jie Zhou, Yue Zhang, Shikun Zhang, Wei Ye
Title: RewardAnything: Generalizable Principle-Following Reward Models
Abstract:
Reward Models, essential for guiding Large Language Model optimization, are typically trained on fixed preference datasets, resulting in rigid alignment to single, implicit preference distributions. This prevents adaptation to diverse real-world needs-from conciseness in one task to detailed explanations in another. The standard practice of collecting task-specific preference data and retraining reward models is resource-intensive, often producing biased rewards, and limits practical application. We introduce generalizable, principle-following reward models. We propose that RMs should understand and adhere to dynamically provided natural language specifications of reward principles, similar to instruction-following in LLMs. To measure this capability, we develop RABench, a comprehensive benchmark for RMs focusing on generalization across diverse principles. Evaluations on RABench reveal poor generalization of current RMs. As a solution, we present RewardAnything, a novel RM designed and trained to explicitly follow natural language principles. We achieve SotA performance with RewardAnything in traditional RM benchmark simply by specifying a well-defined principle, and results on RABench show we excel in adapting to novel principles without retraining. Furthermore, RewardAnything integrates seamlessly with existing RLHF methods and we show by a case study on how to automatically and efficiently align LLMs with only natural language principles.
Chinese Summary: 传统奖励模型依赖固定数据集,难以适应多样任务需求,而新型RewardAnything模型通过动态遵循自然语言原则,无需重新训练即可实现卓越的泛化能力。
English Summary: Reward Models traditionally rely on fixed datasets, limiting their adaptability to varying real-world tasks, but the new RewardAnything model dynamically follows natural language principles, achieving superior generalization without retraining.

Authors:Ayuto Tsutsumi, Yuu Jinnai
Title: Do Large Language Models Know Folktales? A Case Study of Yokai in Japanese Folktales
Abstract:
Although Large Language Models (LLMs) have demonstrated strong language understanding and generation abilities across various languages, their cultural knowledge is often limited to English-speaking communities, which can marginalize the cultures of non-English communities. To address the problem, evaluation of the cultural awareness of the LLMs and the methods to develop culturally aware LLMs have been investigated. In this study, we focus on evaluating knowledge of folktales, a key medium for conveying and circulating culture. In particular, we focus on Japanese folktales, specifically on knowledge of Yokai. Yokai are supernatural creatures originating from Japanese folktales that continue to be popular motifs in art and entertainment today. Yokai have long served as a medium for cultural expression, making them an ideal subject for assessing the cultural awareness of LLMs. We introduce YokaiEval, a benchmark dataset consisting of 809 multiple-choice questions (each with four options) designed to probe knowledge about yokai. We evaluate the performance of 31 Japanese and multilingual LLMs on this dataset. The results show that models trained with Japanese language resources achieve higher accuracy than English-centric models, with those that underwent continued pretraining in Japanese, particularly those based on Llama-3, performing especially well. The code and dataset are available at https://github.com/CyberAgentA ILab/YokaiEval.
中文: 大语言模型常局限于英语文化知识,因此本研究推出YokaiEval基准,通过809道日本妖怪民俗问题测试发现,使用日语资源训练的模型表现优于以英语为中心的模型。
English: Large Language Models often lack cultural knowledge beyond English-speaking communities, so this study introduces YokaiEval, a benchmark of 809 questions on Japanese Yokai folklore, revealing that models trained with Japanese resources outperform English-centric ones.

Authors:Zhanhui Zhou, Lingjie Chen, Chao Yang, Chaochao Lu
Title: VLMs Can Aggregate Scattered Training Patches
Abstract:
One way to mitigate risks in vision-language models (VLMs) is to remove dangerous samples in their training data. However, such data moderation can be easily bypassed when harmful images are split into small, benign-looking patches, scattered across many training samples. VLMs may then learn to piece these fragments together during training and generate harmful responses at inference, either from full images or text references. For instance, if trained on image patches from a bloody scene paired with the descriptions "safe," VLMs may later describe, the full image or a text reference to the scene, as "safe." We define the core ability of VLMs enabling this attack as $\textit{visual stitching}$ -- the ability to integrate visual information spread across multiple training samples that share the same textual descriptions. In our work, we first demonstrate visual stitching abilities in common open-source VLMs on three datasets where each image is labeled with a unique synthetic ID: we split each $(\texttt{image}, \texttt{ID})$ pair into $\{(\texttt{patch}, \texttt{ID})\}$ pairs at different granularity for finetuning, and we find that tuned models can verbalize the correct IDs from full images or text reference. Building on this, we simulate the adversarial data poisoning scenario mentioned above by using patches from dangerous images and replacing IDs with text descriptions like ``safe'' or ``unsafe'', demonstrating how harmful content can evade moderation in patches and later be reconstructed through visual stitching, posing serious VLM safety risks. Code is available at https://github.com/ZHZisZZ/visual-stitching.
中文总结:视觉语言模型通过视觉拼接能力,可从分散的良性图像碎片中重构有害内容,从而规避训练数据审核并在推理时生成危险回应。
English Summary: Vision-language models can reconstruct harmful content from seemingly safe image patches during training, exploiting their visual stitching ability to bypass data moderation and generate dangerous responses.

Authors:Viktor Hangya, Fabian Küch, Darina Gold
Title: From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
Abstract:
Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate, essential capabilities like reasoning and code generation rely on the more time-consuming NLG (token-by-token generation) format. In this work, our aim is to decrease the computational burden of NLG benchmarks in order to enable monitoring crucial LLM capabilities during model training. We reformulate generative tasks into computationally cheaper NLU alternatives. We test the performance correlation between the original and reformulated tasks using 8 LMs of various sizes and 4 capabilities: mathematical reasoning, code generation, factual knowledge and reading comprehension. Our results show a strong correlation between task formats, supporting capability assessment via cheaper alternatives and achieving over 35x average reduction in evaluation time. Our project is available at: https://github.com/Fraunhofer-IIS/EvalShortcut
Chinese: 本研究通过将生成式任务转化为成本更低的选择题形式,显著降低了大型语言模型评估的计算负担,在四项关键能力上实现了强性能相关性,并使评估速度平均提升超过35倍。
English: This study introduces a method to reduce the computational cost of evaluating large language models by converting generative tasks into cheaper multiple-choice formats, achieving strong performance correlation and over 35x faster evaluation across four key capabilities.

Authors:Core Team, Zihao Yue, Zhenru Lin, Yifan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei Li, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia
Title: MiMo-VL Technical Report
Abstract:
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.
中文:MiMo-VL-7B模型在视觉语言任务中实现顶尖性能,通过四阶段预训练和混合强化学习方法在多项基准测试中超越竞争对手,并公开提供了完整模型和评估套件。
English: The MiMo-VL-7B models achieve state-of-the-art performance in vision-language tasks, outperforming competitors across multiple benchmarks through a four-stage pre-training and mixed reinforcement learning approach, with full model and evaluation suite publicly released.

Authors:Langlin Huang, Chengsong Huang, Jixuan Leng, Di Huang, Jiaxin Huang
Title: POSS: Position Specialist Generates Better Draft for Speculative Decoding
Abstract:
Speculative decoding accelerates Large Language Model (LLM) inference by using a small draft model to predict multiple tokens, and a large target model to verify these tokens in parallel. Recent studies leverage the hidden state of the target model to enhance draft model prediction accuracy. However, existing methods suffer from the degrading quality of draft token predictions at later positions, due to error accumulation in draft model generated features. In this paper, we propose Position Specialists (PosS), which consist of multiple position-specialized draft layers to generate tokens at assigned position(s). Position specialists greatly improve token acceptance rate at later positions per drafting round, as each specialist only needs to focus on handling a certain level of draft model feature deviation. Experiment results on Llama-3-8B-Instruct and Llama-2-13B-chat across six datasets demonstrate that PosS effectively improves over baselines on average acceptance length and speed-up ratio. Our codebase is available at https://github.com/shrango/PosS.
中文摘要:提出的位置专家(PosS)方法通过为不同位置分配专门的草稿层,有效减少错误累积,显著提高了大语言模型推理中的令牌接受率和加速比。
English Summary: The proposed Position Specialists (PosS) method enhances speculative decoding by using specialized draft layers for different token positions, effectively reducing error accumulation and improving acceptance rates and inference speed in large language models.

Authors:Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He
Title: How Far Are We from Generating Missing Modalities with Foundation Models?
Abstract:
Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.
中文: 本研究提出了一种智能代理框架,通过动态挖掘丰富语义特征并采用自我优化机制,显著提升了缺失模态重建的精度和适应性,优于现有基础模型。
English: This study introduces an agentic framework that enhances missing modality reconstruction by dynamically mining rich semantic features and employing self-refinement, significantly improving accuracy and adaptability over existing foundation models.

Authors:Chong Li, Jiajun Zhang, Chengqing Zong
Title: TokAlign: Efficient Vocabulary Adaptation via Token Alignment
Abstract:
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text. In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM. The mismatch in vocabulary also hinders deep knowledge transfer between LLMs like token-level distillation. To mitigate this gap, we propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view, and further transfer the token-level knowledge between models. It first aligns the source vocabulary to the target one by learning a one-to-one mapping matrix for token IDs. Model parameters, including embeddings, are rearranged and progressively fine-tuned for the new vocabulary. Our method significantly improves multilingual text compression rates and vocabulary initialization for LLMs, decreasing the perplexity from 3.4$\text{e}^2$ of strong baseline methods to 1.2$\text{e}^2$ after initialization. Experimental results on models across multiple parameter scales demonstrate the effectiveness and generalization of TokAlign, which costs as few as 5k steps to restore the performance of the vanilla model. After unifying vocabularies between LLMs, token-level distillation can remarkably boost (+4.4% than sentence-level distillation) the base model, costing only 235M tokens.
中文摘要:TokAlign是一种通过词汇对齐和参数重组来优化大型语言模型词汇表的高效方法,显著提升了文本压缩率并促进了模型间的知识迁移。
English Summary: TokAlign is an efficient method that aligns vocabularies between Large Language Models by learning token mappings and rearranging parameters, significantly improving text compression and enabling effective token-level knowledge transfer.

Authors:Yuchen Guo, Zhicheng Dou, Huy H. Nguyen, Ching-Chun Chang, Saku Sugawara, Isao Echizen
Title: Measuring Human Involvement in AI-Generated Text: A Case Study on Academic Writing
Abstract:
Content creation has dramatically progressed with the rapid advancement of large language models like ChatGPT and Claude. While this progress has greatly enhanced various aspects of life and work, it has also negatively affected certain areas of society. A recent survey revealed that nearly 30% of college students use generative AI to help write academic papers and reports. Most countermeasures treat the detection of AI-generated text as a binary classification task and thus lack robustness. This approach overlooks human involvement in the generation of content even though human-machine collaboration is becoming mainstream. Besides generating entire texts, people may use machines to complete or revise texts. Such human involvement varies case by case, which makes binary classification a less than satisfactory approach. We refer to this situation as participation detection obfuscation. We propose using BERTScore as a metric to measure human involvement in the generation process and a multi-task RoBERTa-based regressor trained on a token classification task to address this problem. To evaluate the effectiveness of this approach, we simulated academic-based scenarios and created a continuous dataset reflecting various levels of human involvement. All of the existing detectors we examined failed to detect the level of human involvement on this dataset. Our method, however, succeeded (F1 score of 0.9423 and a regressor mean squared error of 0.004). Moreover, it demonstrated some generalizability across generative models. Our code is available at https://github.com/gyc-nii/CAS-CS-and-dual-head-detector
中文: 随着ChatGPT等大型语言模型的快速发展,AI在学术写作中的应用日益普遍,但现有的二元检测方法无法有效评估不同程度的人机协作,因此我们提出了一种基于BERTScore的新方法,能够准确测量人类参与度并取得显著成效。
English: The rapid advancement of large language models like ChatGPT has led to widespread use of AI in academic writing, but current binary detection methods fail to account for varying levels of human-machine collaboration, prompting the development of a novel BERTScore-based approach that successfully measures human involvement with high accuracy.

Authors:Yi Xu, Ruining Yang, Yitian Zhang, Yizhou Wang, Jianglin Lu, Mingyuan Zhang, Lili Su, Yun Fu
Title: Trajectory Prediction Meets Large Language Models: A Survey
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in integrating language-driven techniques into trajectory prediction. By leveraging their semantic and reasoning capabilities, LLMs are reshaping how autonomous systems perceive, model, and predict trajectories. This survey provides a comprehensive overview of this emerging field, categorizing recent work into five directions: (1) Trajectory prediction via language modeling paradigms, (2) Direct trajectory prediction with pretrained language models, (3) Language-guided scene understanding for trajectory prediction, (4) Language-driven data generation for trajectory prediction, (5) Language-based reasoning and interpretability for trajectory prediction. For each, we analyze representative methods, highlight core design choices, and identify open challenges. This survey bridges natural language processing and trajectory prediction, offering a unified perspective on how language can enrich trajectory prediction.
中文: 本综述探讨了将大型语言模型融入轨迹预测的研究进展,将近期工作归纳为五个方向,分析其方法、核心设计及挑战,旨在连接自然语言处理与轨迹预测领域,提供统一视角。
English: This survey explores the integration of large language models into trajectory prediction, categorizing recent advances into five key directions and analyzing their methods, design choices, and challenges to bridge natural language processing with trajectory prediction.

Authors:Zihui Ma, Lingyao Li, Juan Li, Wenyue Hua, Jingxiao Liu, Qingyuan Feng, Yuki Miura
Title: A Multimodal, Multilingual, and Multidimensional Pipeline for Fine-grained Crowdsourcing Earthquake Damage Evaluation
Abstract:
Rapid, fine-grained disaster damage assessment is essential for effective emergency response, yet remains challenging due to limited ground sensors and delays in official reporting. Social media provides a rich, real-time source of human-centric observations, but its multimodal and unstructured nature presents challenges for traditional analytical methods. In this study, we propose a structured Multimodal, Multilingual, and Multidimensional (3M) pipeline that leverages multimodal large language models (MLLMs) to assess disaster impacts. We evaluate three foundation models across two major earthquake events using both macro- and micro-level analyses. Results show that MLLMs effectively integrate image-text signals and demonstrate a strong correlation with ground-truth seismic data. However, performance varies with language, epicentral distance, and input modality. This work highlights the potential of MLLMs for disaster assessment and provides a foundation for future research in applying MLLMs to real-time crisis contexts. The code and data are released at: https://github.com/missa7481/EMNLP25_earthquake
中文: 本研究提出结构化3M流程,利用多模态大语言模型整合图文信号评估灾害影响,结果显示其与地震数据高度相关,但性能受语言、震中距离和输入模态影响。
English: This study introduces a structured 3M pipeline using multimodal large language models to effectively assess disaster impacts by integrating image-text signals, showing strong correlation with seismic data while noting performance variations based on language, distance, and modality.

Authors:Aldan Creo, Héctor Cerezo-Costas, Pedro Alonso-Doval, Maximiliano Hormazábal-Lagos
Title: Ask a Local: Detecting Hallucinations With Specialized Model Divergence
Abstract:
Hallucinations in large language models (LLMs) - instances where models generate plausible but factually incorrect information - present a significant challenge for AI. We introduce "Ask a Local", a novel hallucination detection method exploiting the intuition that specialized models exhibit greater surprise when encountering domain-specific inaccuracies. Our approach computes divergence between perplexity distributions of language-specialized models to identify potentially hallucinated spans. Our method is particularly well-suited for a multilingual context, as it naturally scales to multiple languages without the need for adaptation, relying on external data sources, or performing training. Moreover, we select computationally efficient models, providing a scalable solution that can be applied to a wide range of languages and domains. Our results on a human-annotated question-answer dataset spanning 14 languages demonstrate consistent performance across languages, with Intersection-over-Union (IoU) scores around 0.3 and comparable Spearman correlation values. Our model shows particularly strong performance on Italian and Catalan, with IoU scores of 0.42 and 0.38, respectively, while maintaining cross-lingual effectiveness without language-specific adaptations. We release our code and architecture to facilitate further research in multilingual hallucination detection.
中文: “Ask a Local”方法通过比较专业模型的困惑度分布来检测大语言模型中的幻觉,提供了一种无需调整或外部数据即可扩展的多语言解决方案。
English: The "Ask a Local" method detects hallucinations in LLMs by comparing perplexity distributions of specialized models, offering a scalable, multilingual solution without requiring adaptation or external data.

Authors:Guillermo Marco, Julio Gonzalo, Víctor Fresno
Title: The Reader is the Metric: How Textual Features and Reader Profiles Explain Conflicting Evaluations of AI Creative Writing
Abstract:
Recent studies comparing AI-generated and human-authored literary texts have produced conflicting results: some suggest AI already surpasses human quality, while others argue it still falls short. We start from the hypothesis that such divergences can be largely explained by genuine differences in how readers interpret and value literature, rather than by an intrinsic quality of the texts evaluated. Using five public datasets (1,471 stories, 101 annotators including critics, students, and lay readers), we (i) extract 17 reference-less textual features (e.g., coherence, emotional variance, average sentence length...); (ii) model individual reader preferences, deriving feature importance vectors that reflect their textual priorities; and (iii) analyze these vectors in a shared "preference space". Reader vectors cluster into two profiles: 'surface-focused readers' (mainly non-experts), who prioritize readability and textual richness; and 'holistic readers' (mainly experts), who value thematic development, rhetorical variety, and sentiment dynamics. Our results quantitatively explain how measurements of literary quality are a function of how text features align with each reader's preferences. These findings advocate for reader-sensitive evaluation frameworks in the field of creative text generation.
中文摘要:最新研究表明,关于AI与人类文学质量评价的矛盾源于读者偏好差异:表层导向型读者注重可读性与文本丰富性,而整体导向型读者更看重主题发展及情感动态。
English Summary: Recent research reveals that conflicting assessments of AI versus human literary quality stem from distinct reader preferences, with surface-focused readers valuing readability and textual richness, while holistic readers prioritize thematic depth and sentiment dynamics.

Authors:Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, Jayant Kalagnanam
Title: FailureSensorIQ: A Multi-Choice QA Dataset for Understanding Sensor Relationships and Failure Modes
Abstract:
We introduce FailureSensorIQ, a novel Multi-Choice Question-Answering (MCQA) benchmarking system designed to assess the ability of Large Language Models (LLMs) to reason and understand complex, domain-specific scenarios in Industry 4.0. Unlike traditional QA benchmarks, our system focuses on multiple aspects of reasoning through failure modes, sensor data, and the relationships between them across various industrial assets. Through this work, we envision a paradigm shift where modeling decisions are not only data-driven using statistical tools like correlation analysis and significance tests, but also domain-driven by specialized LLMs which can reason about the key contributors and useful patterns that can be captured with feature engineering. We evaluate the Industrial knowledge of over a dozen LLMs-including GPT-4, Llama, and Mistral-on FailureSensorIQ from different lens using Perturbation-Uncertainty-Complexity analysis, Expert Evaluation study, Asset-Specific Knowledge Gap analysis, ReAct agent using external knowledge-bases. Even though closed-source models with strong reasoning capabilities approach expert-level performance, the comprehensive benchmark reveals a significant drop in performance that is fragile to perturbations, distractions, and inherent knowledge gaps in the models. We also provide a real-world case study of how LLMs can drive the modeling decisions on 3 different failure prediction datasets related to various assets. We release: (a) expert-curated MCQA for various industrial assets, (b) FailureSensorIQ benchmark and Hugging Face leaderboard based on MCQA built from non-textual data found in ISO documents, and (c) LLMFeatureSelector, an LLM-based feature selection scikit-learn pipeline. The software is available at https://github.com/IBM/FailureSensorIQ.
中文摘要:FailureSensorIQ是一个专为评估大语言模型在工业4.0领域复杂推理能力而设计的创新多选问答基准系统,通过多维度分析揭示了现有模型在扰动响应和专业知识方面存在的显著不足。
English Summary: FailureSensorIQ is a specialized MCQA benchmark that evaluates LLMs' reasoning capabilities in Industry 4.0 scenarios, revealing performance gaps despite some models nearing expert-level accuracy.

Authors:Selcuk Gurses, Aozhong Zhang, Yanxia Deng, Xun Dong, Xin Li, Naigang Wang, Penghang Yin, Zi Yang
Title: DiaBlo: Diagonal Blocks Are Sufficient For Finetuning
Abstract:
Finetuning is a critical step for adapting large language models (LLMs) to domain-specific downstream tasks. To mitigate the substantial computational and memory costs of full-model fine-tuning, Parameter-Efficient Finetuning (PEFT) methods have been proposed to update only a small subset of model parameters. However, performance gaps between PEFT approaches and full-model fine-tuning still exist. In this work, we present DiaBlo, a simple yet effective PEFT approach that updates only the diagonal blocks of selected model weight matrices. Unlike Low Rank Adaptation (LoRA) and its variants, DiaBlo eliminates the need for low rank matrix products, thereby avoiding the reliance on auxiliary initialization schemes or customized optimization strategies to improve convergence. This design leads to stable and robust convergence while maintaining comparable memory efficiency and training speed to LoRA. We conduct extensive experiments across a range of tasks, including commonsense reasoning, arithmetic reasoning, code generation, and safety alignment, to evaluate the effectiveness and efficiency of DiaBlo. Across these benchmarks, DiaBlo demonstrates strong and consistent performance while maintaining high memory efficiency and fast finetuning speed. Codes are available at https://github.com/ziyangjoy/DiaBlo.
中文: DiaBlo是一种参数高效微调方法,仅更新权重矩阵的对角块,无需低秩近似或特殊初始化即可实现稳定收敛,并保持与LoRA相当的训练效率。
English: DiaBlo is a parameter-efficient fine-tuning method that updates only diagonal blocks of weight matrices, achieving stable convergence and comparable efficiency to LoRA without requiring low-rank approximations or special initialization.

Authors:Yinjie Wang, Ling Yang, Ye Tian, Ke Shen, Mengdi Wang
Title: Co-Evolving LLM Coder and Unit Tester via Reinforcement Learning
Abstract:
We propose CURE, a novel reinforcement learning framework with a dedicated reward design that co-evolves coding and unit test generation capabilities based on their interaction outcomes, without any ground-truth code as supervision. This approach enables flexible and scalable training and allows the unit tester to learn directly from the coder's mistakes. Our derived ReasonFlux-Coder-7B and 14B models improve code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% after optimization on Qwen2.5-Instruct models, outperforming similarly sized Qwen-Coder, DeepSeek-Coder, and Seed-Coder. They naturally extend to downstream tasks such as test-time scaling and agentic coding-achieving a 8.1% improvement over the base model. For the long-CoT model, our ReasonFlux-Coder-4B consistently outperforms Qwen3-4B while achieving 64.8% inference efficiency in unit test generation. Notably, we also find that our model can serve as an effective reward model for reinforcement learning on base models. Project: https://github.com/Gen-Verse/CURE
中文: CURE框架通过基于交互的奖励机制协同进化代码生成与单元测试能力,在Qwen2.5模型上实现代码生成准确率提升5.3%,Best-of-N准确率提升9.0%,并能有效扩展至下游任务。
English: CURE is a reinforcement learning framework that co-evolves coding and unit test generation through interaction-based rewards, improving code generation accuracy by 5.3% and Best-of-N accuracy by 9.0% while enabling effective downstream applications.

Authors:Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi
Title: OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
Abstract:
Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. We also explore two strategies-PointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought)-to bolster spatial reasoning.
中文: OmniSpatial是一个基于认知心理学的综合性空间推理基准,涵盖四大类别并揭示当前视觉语言模型存在显著不足,同时提出了两种改进策略。
English: OmniSpatial is a comprehensive benchmark for spatial reasoning in vision-language models, covering four major categories and revealing significant limitations in current models despite proposed enhancement strategies.

Authors:Li Zhang, Kevin D. Ashley
Title: Mitigating Manipulation and Enhancing Persuasion: A Reflective Multi-Agent Approach for Legal Argument Generation
Abstract:
Large Language Models (LLMs) are increasingly explored for legal argument generation, yet they pose significant risks of manipulation through hallucination and ungrounded persuasion, and often fail to utilize provided factual bases effectively or abstain when arguments are untenable. This paper introduces a novel reflective multi-agent method designed to address these challenges in the context of legally compliant persuasion. Our approach employs specialized agents--a Factor Analyst and an Argument Polisher--in an iterative refinement process to generate 3-ply legal arguments (plaintiff, defendant, rebuttal). We evaluate Reflective Multi-Agent against single-agent, enhanced-prompt single-agent, and non-reflective multi-agent baselines using four diverse LLMs (GPT-4o, GPT-4o-mini, Llama-4-Maverick-17b-128e, Llama-4-Scout-17b-16e) across three legal scenarios: "arguable", "mismatched", and "non-arguable". Results demonstrate Reflective Multi-Agent's significant superiority in successful abstention (preventing generation when arguments cannot be grounded), marked improvements in hallucination accuracy (reducing fabricated and misattributed factors), particularly in "non-arguable" scenarios, and enhanced factor utilization recall (improving the use of provided case facts). These findings suggest that structured reflection within a multi-agent framework offers a robust computable method for fostering ethical persuasion and mitigating manipulation in LLM-based legal argumentation systems, a critical step towards trustworthy AI in law. Project page: https://lizhang-aiandlaw.github.io/A-Reflective-Multi-Agent-Approach-for-Legal-Argument-Generation/

Authors:Yin Fang, Qiao Jin, Guangzhi Xiong, Bowen Jin, Xianrui Zhong, Siru Ouyang, Aidong Zhang, Jiawei Han, Zhiyong Lu
Title: Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning
Abstract:
Cell type annotation is a key task in analyzing the heterogeneity of single-cell RNA sequencing data. Although recent foundation models automate this process, they typically annotate cells independently, without considering batch-level cellular context or providing explanatory reasoning. In contrast, human experts often annotate distinct cell types for different cell clusters based on their domain knowledge. To mimic this workflow, we introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells. This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness. We find that off-the-shelf large language models (LLMs) struggle on CellPuzzles, with the best baseline (OpenAI's o1) achieving only 19.0% batch-level accuracy. To fill this gap, we propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards. Cell-o1 achieves state-of-the-art performance, outperforming o1 by over 73% and generalizing well across contexts. Further analysis of training dynamics and reasoning behaviors provides insights into batch-level annotation performance and emergent expert-like reasoning. Code and data are available at https://github.com/ncbi-nlp/cell-o1.
中文: 本研究提出了CellPuzzles基准任务,要求通过批次级推理在不同条件下分配唯一细胞类型,并开发了Cell-o1模型——一个通过监督微调和批次级奖励强化学习训练的70亿参数大语言模型,实现了最先进的性能表现。
English: The study introduces CellPuzzles, a benchmark task requiring batch-level reasoning to assign unique cell types across diverse conditions, and proposes Cell-o1, a 7B LLM that achieves state-of-the-art performance by leveraging supervised fine-tuning and reinforcement learning with batch-level rewards.

Authors:Chen Qian, Dongrui Liu, Haochen Wen, Zhen Bai, Yong Liu, Jing Shao
Title: Demystifying Reasoning Dynamics with Mutual Information: Thinking Tokens are Information Peaks in LLM Reasoning
Abstract:
Large reasoning models (LRMs) have demonstrated impressive capabilities in complex problem-solving, yet their internal reasoning mechanisms remain poorly understood. In this paper, we investigate the reasoning trajectories of LRMs from an information-theoretic perspective. By tracking how mutual information (MI) between intermediate representations and the correct answer evolves during LRM reasoning, we observe an interesting MI peaks phenomenon: the MI at specific generative steps exhibits a sudden and significant increase during LRM's reasoning process. We theoretically analyze such phenomenon and show that as MI increases, the probability of model's prediction error decreases. Furthermore, these MI peaks often correspond to tokens expressing reflection or transition, such as ``Hmm'', ``Wait'' and ``Therefore,'' which we term as the thinking tokens. We then demonstrate that these thinking tokens are crucial for LRM's reasoning performance, while other tokens has minimal impacts. Building on these analyses, we propose two simple yet effective methods to improve LRM's reasoning performance, by delicately leveraging these thinking tokens. Overall, our work provides novel insights into the reasoning mechanisms of LRMs and offers practical ways to improve their reasoning capabilities. The code is available at https://github.com/ChnQ/MI-Peaks.
中文: 研究表明大型推理模型在解题过程中会出现互信息峰值现象,尤其在"思考标记"(如"嗯"、"因此"等)处最为显著,这些标记对模型推理性能至关重要,可被巧妙利用来提升其推理能力。
English: This study reveals that large reasoning models exhibit sudden mutual information peaks during problem-solving, particularly at "thinking tokens" like "Hmm" or "Therefore," which are crucial for accurate predictions and can be leveraged to enhance reasoning performance.

Authors:Ekaterina Grishina, Mikhail Gorbunov, Maxim Rakhuba
Title: ProcrustesGPT: Compressing LLMs with Structured Matrices and Orthogonal Transformations
Abstract:
Large language models (LLMs) demonstrate impressive results in natural language processing tasks but require a significant amount of computational and memory resources. Structured matrix representations are a promising way for reducing the number of parameters of these models. However, it seems unrealistic to expect that weight matrices of pretrained models can be accurately represented by structured matrices without any fine-tuning. To overcome this issue, we utilize the fact that LLM output is invariant under certain orthogonal transformations of weight matrices. This insight can be leveraged to identify transformations that significantly improve the compressibility of weights within structured classes. The proposed approach is applicable to various types of structured matrices that support efficient projection operations. Code is available at https://github.com/GrishKate/ProcrustesGPT
中文: 大语言模型可通过利用保持输出不变的正交变换,采用结构化矩阵进行压缩,从而实现无需微调的高效参数削减。
English: Large language models can be compressed using structured matrices by leveraging orthogonal transformations that maintain output invariance, enabling efficient parameter reduction without fine-tuning.

Authors:Renyang Liu, Wenjie Feng, Tianwei Zhang, Wei Zhou, Xueqi Cheng, See-Kiong Ng
Title: Rethinking Machine Unlearning in Image Generation Models
Abstract:
With the surge and widespread application of image generation models, data privacy and content safety have become major concerns and attracted great attention from users, service providers, and policymakers. Machine unlearning (MU) is recognized as a cost-effective and promising means to address these challenges. Despite some advancements, image generation model unlearning (IGMU) still faces remarkable gaps in practice, e.g., unclear task discrimination and unlearning guidelines, lack of an effective evaluation framework, and unreliable evaluation metrics. These can hinder the understanding of unlearning mechanisms and the design of practical unlearning algorithms. We perform exhaustive assessments over existing state-of-the-art unlearning algorithms and evaluation standards, and discover several critical flaws and challenges in IGMU tasks. Driven by these limitations, we make several core contributions, to facilitate the comprehensive understanding, standardized categorization, and reliable evaluation of IGMU. Specifically, (1) We design CatIGMU, a novel hierarchical task categorization framework. It provides detailed implementation guidance for IGMU, assisting in the design of unlearning algorithms and the construction of testbeds. (2) We introduce EvalIGMU, a comprehensive evaluation framework. It includes reliable quantitative metrics across five critical aspects. (3) We construct DataIGM, a high-quality unlearning dataset, which can be used for extensive evaluations of IGMU, training content detectors for judgment, and benchmarking the state-of-the-art unlearning algorithms. With EvalIGMU and DataIGM, we discover that most existing IGMU algorithms cannot handle the unlearning well across different evaluation dimensions, especially for preservation and robustness. Code and models are available at https://github.com/ryliu68/IGMU.
中文: 本研究针对图像生成模型遗忘中的关键问题,提出了CatIGMU任务分类框架、EvalIGMU评估体系和DataIGM数据集,发现现有算法在保持性和鲁棒性方面存在明显不足。
English: The study addresses critical gaps in image generation model unlearning by introducing CatIGMU for task categorization, EvalIGMU for evaluation, and DataIGM for benchmarking, revealing that current algorithms struggle with preservation and robustness.

Authors:Zhi-Yuan Chen, Hao Wang, Xinyu Zhang, Enrui Hu, Yankai Lin
Title: Beyond the Surface: Measuring Self-Preference in LLM Judgments
Abstract:
Recent studies show that large language models (LLMs) exhibit self-preference bias when serving as judges, meaning they tend to favor their own responses over those generated by other models. Existing methods typically measure this bias by calculating the difference between the scores a judge model assigns to its own responses and those it assigns to responses from other models. However, this approach conflates self-preference bias with response quality, as higher-quality responses from the judge model may also lead to positive score differences, even in the absence of bias. To address this issue, we introduce gold judgments as proxies for the actual quality of responses and propose the DBG score, which measures self-preference bias as the difference between the scores assigned by the judge model to its own responses and the corresponding gold judgments. Since gold judgments reflect true response quality, the DBG score mitigates the confounding effect of response quality on bias measurement. Using the DBG score, we conduct comprehensive experiments to assess self-preference bias across LLMs of varying versions, sizes, and reasoning abilities. Additionally, we investigate two factors that influence and help alleviate self-preference bias: response text style and the post-training data of judge models. Finally, we explore potential underlying mechanisms of self-preference bias from an attention-based perspective. Our code and data are available at https://github.com/zhiyuanc2001/self-preference.
Chinese: 本研究提出DBG评分法,通过比较模型自评分与黄金标准判断来量化大语言模型的自偏好偏差,揭示了响应风格和训练数据等影响因素,并从注意力机制角度探讨了其潜在成因。
English: This study introduces the DBG score to measure self-preference bias in LLMs by comparing a model's self-assigned scores against gold-standard judgments, revealing how factors like response style and training data influence bias while exploring its attention-based mechanisms.

Authors:Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, Balaji Krishnamurthy
Title: Learning Together to Perform Better: Teaching Small-Scale LLMs to Collaborate via Preferential Rationale Tuning
Abstract:
LLMssuch as GPT-4 have shown a remarkable ability to solve complex questions by generating step-by-step rationales. Prior works have utilized this capability to improve smaller and cheaper LMs (say, with 7B parameters). However, various practical constraints, such as copyright and legal issues, owing to lack of transparency in the pre-training data of large (often closed) models, prevent their use in commercial settings. Little focus has been given to improving the innate reasoning ability of smaller models without distilling information from larger LLMs. To address this, we propose COLLATE, a trainable framework that tunes a (small) LLM to generate those outputs from a pool of diverse rationales that selectively improves the downstream task. COLLATE enforces multiple instances of the same LLM to exhibit distinct behavior and employs them to generate rationales to obtain diverse outputs. The LLM is then tuned via preference optimization to choose the candidate rationale which maximizes the likelihood of ground-truth answer. COLLATE outperforms several trainable and prompting baselines on 5 datasets across 3 domains: maths problem solving, natural language inference, and commonsense reasoning. We show the eff icacy of COLLATE on LLMs from different model families across varying parameter scales (1B to 8B) and demonstrate the benefit of multiple rationale providers guided by the end task through ablations. Code is released here (https://github.com/Sohanpatnaik106/collate).
中文摘要:COLLATE框架通过训练小型语言模型从多样化输出中选择最优推理路径,在不依赖大模型的情况下显著提升了数学解题、自然语言推理和常识推理等多个领域的性能表现。
English Summary: The COLLATE framework enhances smaller language models' reasoning abilities by training them to select optimal rationales from diverse outputs, achieving superior performance across multiple domains without relying on larger models.

Authors:Maryam Berijanian, Kuldeep Singh, Amin Sehati
Title: Comparative Analysis of AI Agent Architectures for Entity Relationship Classification
Abstract:
Entity relationship classification remains a challenging task in information extraction, especially in scenarios with limited labeled data and complex relational structures. In this study, we conduct a comparative analysis of three distinct AI agent architectures designed to perform relation classification using large language models (LLMs). The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism, each leveraging different modes of reasoning and prompt adaptation. In particular, our dynamic example generation approach introduces real-time cooperative and adversarial prompting. We systematically compare their performance across multiple domains and model backends. Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting and approaches the performance of fine-tuned models. These findings offer practical guidance for the design of modular, generalizable LLM-based systems for structured relation extraction. The source codes and dataset are available at https://github.com/maryambrj/ALIEN.git.
中文: 本研究比较了三种基于大语言模型的关系分类智能体架构,发现多智能体协作方法持续优于标准少样本提示,并接近微调模型性能,为模块化关系提取系统提供了实用指导。
English: This study compares three AI agent architectures for entity relationship classification using large language models, finding that multi-agent coordination outperforms standard few-shot prompting and approaches fine-tuned model performance, offering practical guidance for modular LLM systems.

Authors:Fengjin Li, Jie Wang, Yadong Niu, Yongqing Wang, Meng Meng, Jian Luan, Zhiyong Wu
Title: StarVC: A Unified Auto-Regressive Framework for Joint Text and Speech Generation in Voice Conversion
Abstract:
Voice Conversion (VC) modifies speech to match a target speaker while preserving linguistic content. Traditional methods usually extract speaker information directly from speech while neglecting the explicit utilization of linguistic content. Since VC fundamentally involves disentangling speaker identity from linguistic content, leveraging structured semantic features could enhance conversion performance. However, previous attempts to incorporate semantic features into VC have shown limited effectiveness, motivating the integration of explicit text modeling. We propose StarVC, a unified autoregressive VC framework that first predicts text tokens before synthesizing acoustic features. The experiments demonstrate that StarVC outperforms conventional VC methods in preserving both linguistic content (i.e., WER and CER) and speaker characteristics (i.e., SECS and MOS). Audio demo can be found at: https://thuhcsi.github.io/StarVC/.

Authors:Herun Wan, Jiaying Wu, Minnan Luo, Zhi Zeng, Zhixiong Su
Title: Truth over Tricks: Measuring and Mitigating Shortcut Learning in Misinformation Detection
Abstract:
Misinformation detection models often rely on superficial cues (i.e., \emph{shortcuts}) that correlate with misinformation in training data but fail to generalize to the diverse and evolving nature of real-world misinformation. This issue is exacerbated by large language models (LLMs), which can easily generate convincing misinformation through simple prompts. We introduce TruthOverTricks, a unified evaluation paradigm for measuring shortcut learning in misinformation detection. TruthOverTricks categorizes shortcut behaviors into intrinsic shortcut induction and extrinsic shortcut injection, and evaluates seven representative detectors across 14 popular benchmarks, along with two new factual misinformation datasets, NQ-Misinfo and Streaming-Misinfo. Empirical results reveal that existing detectors suffer severe performance degradation when exposed to both naturally occurring and adversarially crafted shortcuts. To address this, we propose SMF, an LLM-augmented data augmentation framework that mitigates shortcut reliance through paraphrasing, factual summarization, and sentiment normalization. SMF consistently enhances robustness across 16 benchmarks, encouraging models to rely on deeper semantic understanding rather than shortcut cues. To promote the development of misinformation detectors, we have published the resources publicly at https://github.com/whr000001/TruthOverTricks.
中文:当前虚假信息检测模型常依赖表面线索,难以应对现实场景,尤其面对大语言模型生成的虚假信息时,但提出的SMF框架通过数据增强鼓励深层语义分析,有效提升了检测的鲁棒性。
English: Current misinformation detection models often depend on superficial shortcuts that fail in real-world scenarios, especially with LLM-generated misinformation, but the proposed SMF framework enhances robustness by using data augmentation to encourage deeper semantic analysis.

Authors:Michael Li, Nishant Subramani
Title: Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models
Abstract:
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today's language models, we investigate how 25 models - from classical architectures (BERT, DeBERTa, GPT-2) to modern large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) - represent lexical identity and inflectional morphology across six typologically diverse languages. Using linear and nonlinear classifiers trained on hidden activations, we predict word lemmas and inflectional features layer by layer. We find that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout. Additional experiments probe the nature of these encodings: attention and residual analyses examine where within layers information can be recovered, steering vector experiments test what information can be functionally manipulated, and intrinsic dimensionality analyses explore how the representational structure evolves across layers. Remarkably, these encoding patterns emerge across all models we test, despite differences in architecture, size, and training regime (pretrained and instruction-tuned variants). This suggests that, even with substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties are important for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing
中文: 研究表明,无论架构如何差异,Transformer语言模型都一致地在早期层线性编码词汇信息,在后期层非线性编码,同时保持屈折形态信息在各层中均匀可访问且线性可分。
English: This study reveals that transformer language models consistently encode lexical information linearly in early layers and nonlinearly in later layers, while maintaining inflectional morphology as uniformly accessible linear representations across all layers, regardless of architectural differences.

Authors:E Fan, Kang Hu, Zhuowen Wu, Jiangyang Ge, Jiawei Miao, Yuzhi Zhang, He Sun, Weizong Wang, Tianhan Zhang
Title: ChatCFD: An LLM-Driven Agent for End-to-End CFD Automation with Domain-Specific Structured Reasoning
Abstract:
Computational Fluid Dynamics (CFD) is essential for advancing scientific and engineering fields but is hindered by operational complexity, high expertise requirements, and limited accessibility. This paper introduces ChatCFD, an automated agent system for OpenFOAM simulations that processes multi-modal inputs (e.g., research papers, meshes) via an interactive interface, leveraging DeepSeek-R1 and DeepSeek-V3 large language models, a multi-agent architecture, and OpenFOAM knowledge. Its four-stage pipeline (Knowledge Base Construction, User Input Processing, Case File Generation, and Execution and Error Reflection) enables iterative trial-reflection-refinement for intricate setups, supporting diverse physical models and external meshes. Validation on 205 benchmark tutorial cases, 110 perturbed variants, and 2 literature-derived cases shows ChatCFD's 82.1 percent operational success rate on basic cases, outperforming MetaOpenFOAM (6.2 percent) and Foam-Agent (42.3 percent), and 60-80 percent on literature-derived complex cases. Turbulence model studies show a 40 percent success rate for common models versus 10 percent for rare ones like RNG k-epsilon. Physics coupling analyses reveal higher resource demands for multi-physics-coupled cases, while LLM bias toward simpler setups introduces persistent errors, such as dimensional inconsistency. Ablation studies highlight the efficacy of RAG-based modules and reflection mechanisms. By automating hypothesis testing and parameter exploration, ChatCFD accelerates scientific discovery in fluid mechanics and engineering, addressing LLM limitations through structured design and showing strong potential as a modular component in MCP-based agent networks for collaborative multi-agent systems, paving the way for scalable AI-driven CFD innovation. The code for ChatCFD is available at https://github.com/ConMoo/ChatCFD.
Chinese: ChatCFD是一种基于大语言模型的自动化代理系统,通过多阶段流程简化OpenFOAM仿真,在各类流体力学案例中取得较高成功率,其模块化设计有效克服了大模型的局限性。
English: ChatCFD is an automated agent system that simplifies OpenFOAM simulations using large language models and a multi-stage pipeline, achieving high success rates in diverse fluid dynamics cases while addressing LLM limitations through structured design.

Authors:Christopher Lee Lübbers
Title: Enhancing Paraphrase Type Generation: The Impact of DPO and RLHF Evaluated with Human-Ranked Data
Abstract:
Paraphrasing re-expresses meaning to enhance applications like text simplification, machine translation, and question-answering. Specific paraphrase types facilitate accurate semantic analysis and robust language models. However, existing paraphrase-type generation methods often misalign with human preferences due to reliance on automated metrics and limited human-annotated training data, obscuring crucial aspects of semantic fidelity and linguistic transformations. This study addresses this gap by leveraging a human-ranked paraphrase-type dataset and integrating Direct Preference Optimization (DPO) to align model outputs directly with human judgments. DPO-based training increases paraphrase-type generation accuracy by 3 percentage points over a supervised baseline and raises human preference ratings by 7 percentage points. A newly created human-annotated dataset supports more rigorous future evaluations. Additionally, a paraphrase-type detection model achieves F1 scores of 0.91 for addition/deletion, 0.78 for same polarity substitution, and 0.70 for punctuation changes. These findings demonstrate that preference data and DPO training produce more reliable, semantically accurate paraphrases, enabling downstream applications such as improved summarization and more robust question-answering. The PTD model surpasses automated metrics and provides a more reliable framework for evaluating paraphrase quality, advancing paraphrase-type research toward richer, user-aligned language generation and establishing a stronger foundation for future evaluations grounded in human-centric criteria.
中文: 本研究通过采用人工排序数据和直接偏好优化(DPO)改进了复述生成,提高了准确率和人工偏好评分,并引入了检测模型和数据集以改善评估及下游应用。
English: This study improves paraphrase generation by using human-ranked data and Direct Preference Optimization (DPO), resulting in higher accuracy and human preference ratings, and introduces a detection model and dataset for better evaluation and downstream applications.

Authors:Jennifer Chen, Aidar Myrzakhan, Yaxin Luo, Hassaan Muhammad Khan, Sondos Mahmoud Bsharat, Zhiqiang Shen
Title: DRAG: Distilling RAG for SLMs from LLMs to Transfer Knowledge and Mitigate Hallucination via Evidence and Graph-based Distillation
Abstract:
Retrieval-Augmented Generation (RAG) methods have proven highly effective for tasks requiring factual consistency and robust knowledge retrieval. However, large-scale RAG systems consume significant computational resources and are prone to generating hallucinated content from Humans. In this work, we introduce $\texttt{DRAG}$, a novel framework for distilling RAG knowledge from large-scale Language Models (LLMs) into small LMs (SLMs). Our approach leverages evidence- and knowledge graph-based distillation, ensuring that the distilled model retains critical factual knowledge while significantly reducing model size and computational cost. By aligning the smaller model's predictions with a structured knowledge graph and ranked evidence, $\texttt{DRAG}$ effectively mitigates hallucinations and improves factual accuracy. We further present a case demonstrating how our framework mitigates user privacy risks and introduce a corresponding benchmark. Experimental evaluations on multiple benchmarks demonstrate that our method outperforms the prior competitive RAG methods like MiniRAG for SLMs by up to 27.7% using the same models, preserving high-level efficiency and reliability. With $\texttt{DRAG}$, we provide a practical and resource-efficient roadmap to deploying enhanced retrieval and generation capabilities in small-sized LLMs.
中文: DRAG框架通过证据和知识图谱将大型语言模型的知识提炼到小型模型中,在显著降低计算成本的同时,有效提升事实准确性并减少幻觉生成。
English: The DRAG framework effectively distills knowledge from large to small language models using evidence and knowledge graphs, significantly reducing computational costs while enhancing factual accuracy and mitigating hallucinations.

Authors:Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh, Zhoujun Cheng, Zhengzhong Liu, Eric Xing, John Thickstun, Arash Vahdat
Title: Esoteric Language Models
Abstract:
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation. Among this family of models, Masked Diffusion Models (MDMs) achieve the strongest performance but still underperform AR models in perplexity and lack key inference-time efficiency features--most notably, KV caching. In this work, we introduce Eso-LMs, a new family of models that fuses AR and MDM paradigms, enabling smooth interpolation between their perplexities while overcoming their respective limitations. Eso-LMs set a new state of the art on standard language modeling benchmarks. Crucially, we are the **first to introduce KV caching for MDMs** while preserving parallel generation, significantly improving inference efficiency. Combined with an optimized sampling schedule, our method achieves up to **65x** faster inference than standard MDMs and **4x** faster inference than prior semi-autoregressive approaches. We provide the code and model checkpoints on the project page: [http://s-sahoo.github.io/Eso-LMs](http://s-sahoo.github.io/Eso-LMs)

Authors:Chi-Jane Chen, Yuhang Chen, Sukwon Yun, Natalie Stanley, Tianlong Chen
Title: Spatial Coordinates as a Cell Language: A Multi-Sentence Framework for Imaging Mass Cytometry Analysis
Abstract:
Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry's analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information: they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently: they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates single-cell expression and spatial information into natural language using a multi-sentence approach. Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations enable LLMs to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here: https://github.com/UNITES-Lab/Spatial2Sentence.
中文: Spatial2Sentence框架通过将空间和表达数据转化为多句子表示,克服了现有单细胞大语言模型的局限性,在糖尿病数据集上显著提升了细胞类型分类和临床预测的准确性。
English: The proposed Spatial2Sentence framework overcomes limitations in existing single-cell large language models by integrating spatial and expression data into multi-sentence representations, achieving significant improvements in cell-type classification and clinical prediction on diabetes datasets.

Authors:Yijin Guo, Kaiyuan Ji, Xiaorong Zhu, Junying Wang, Farong Wen, Chunyi Li, Zicheng Zhang, Guangtao Zhai
Title: Human-Centric Evaluation for Foundation Models
Abstract:
Currently, nearly all evaluations of foundation models focus on objective metrics, emphasizing quiz performance to define model capabilities. While this model-centric approach enables rapid performance assessment, it fails to reflect authentic human experiences. To address this gap, we propose a Human-Centric subjective Evaluation (HCE) framework, focusing on three core dimensions: problem-solving ability, information quality, and interaction experience. Through experiments involving Deepseek R1, OpenAI o3 mini, Grok 3, and Gemini 2.5, we conduct over 540 participant-driven evaluations, where humans and models collaborate on open-ended research tasks, yielding a comprehensive subjective dataset. This dataset captures diverse user feedback across multiple disciplines, revealing distinct model strengths and adaptability. Our findings highlight Grok 3's superior performance, followed by Deepseek R1 and Gemini 2.5, with OpenAI o3 mini lagging behind. By offering a novel framework and a rich dataset, this study not only enhances subjective evaluation methodologies but also lays the foundation for standardized, automated assessments, advancing LLM development for research and practical scenarios. Our dataset link is https://github.com/yijinguo/Human-Centric-Evaluation.
中文摘要:本研究提出以人为中心的评估框架,通过问题解决能力和交互体验等主观维度弥补客观评估的不足,在大量用户参与的实验中发现Grok 3表现最佳,为LLM开发提供了新的评估方法和丰富数据集。
English Summary: This study introduces a Human-Centric Evaluation framework to address the limitations of objective metrics by assessing foundation models through subjective dimensions like problem-solving and interaction quality, revealing Grok 3's top performance among tested models through extensive user-driven experiments.

Authors:Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury
Title: Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
Abstract:
High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.
中文: 本立场文件提出DataRubrics框架,通过基于量规的指标和LLM驱动的评估,系统性地评判数据集质量,以解决数据研究中原创性、透明度和可复现性不足的问题。
English: This position paper proposes DataRubrics, a structured framework using rubric-based metrics and LLM-powered evaluation to systematically assess dataset quality, addressing gaps in originality, transparency, and reproducibility in data-centric research.

Authors:Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu
Title: StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Abstract:
Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok.
Chinese Summary: 本文提出StochasTok随机分词方法,通过在训练中随机分割词汇来增强大语言模型的子词理解能力,有效提升了字符计数和数学运算等任务的性能,且无需昂贵的重新预训练。
English Summary: This paper introduces StochasTok, a stochastic tokenization method that enhances large language models' subword-level understanding by randomly splitting tokens during training, improving performance on tasks like character counting and math problems without requiring costly retraining.

Authors:Sunkyung Lee, Minjin Choi, Eunseong Choi, Hye-young Kim, Jongwuk Lee
Title: GRAM: Generative Recommendation via Semantic-aware Multi-granular Late Fusion
Abstract:
Generative recommendation is an emerging paradigm that leverages the extensive knowledge of large language models by formulating recommendations into a text-to-text generation task. However, existing studies face two key limitations in (i) incorporating implicit item relationships and (ii) utilizing rich yet lengthy item information. To address these challenges, we propose a Generative Recommender via semantic-Aware Multi-granular late fusion (GRAM), introducing two synergistic innovations. First, we design semantic-to-lexical translation to encode implicit hierarchical and collaborative item relationships into the vocabulary space of LLMs. Second, we present multi-granular late fusion to integrate rich semantics efficiently with minimal information loss. It employs separate encoders for multi-granular prompts, delaying the fusion until the decoding stage. Experiments on four benchmark datasets show that GRAM outperforms eight state-of-the-art generative recommendation models, achieving significant improvements of 11.5-16.0% in Recall@5 and 5.3-13.6% in NDCG@5. The source code is available at https://github.com/skleee/GRAM.
中文:GRAM模型通过将隐含的项目关系编码至语言模型词汇空间,并采用多粒度延迟融合高效整合丰富语义,在生成式推荐任务中相比现有方法实现了显著性能提升。
English: The proposed GRAM model enhances generative recommendation by encoding implicit item relationships into language model vocabulary and employing multi-granular late fusion to efficiently integrate rich item semantics, achieving significant performance improvements over existing methods.

Authors:Zixiao Zhu, Kezhi Mao
Title: Domain Lexical Knowledge-based Word Embedding Learning for Text Classification under Small Data
Abstract:
Pre-trained language models such as BERT have been proved to be powerful in many natural language processing tasks. But in some text classification applications such as emotion recognition and sentiment analysis, BERT may not lead to satisfactory performance. This often happens in applications where keywords play critical roles in the prediction of class labels. Our investigation found that the root cause of the problem is that the context-based BERT embedding of the keywords may not be discriminative enough to produce discriminative text representation for classification. Motivated by this finding, we develop a method to enhance word embeddings using domain-specific lexical knowledge. The knowledge-based embedding enhancement model projects the BERT embedding into a new space where within-class similarity and between-class difference are maximized. To implement the knowledge-based word embedding enhancement model, we also develop a knowledge acquisition algorithm for automatically collecting lexical knowledge from online open sources. Experiment results on three classification tasks, including sentiment analysis, emotion recognition and question answering, have shown the effectiveness of our proposed word embedding enhancing model. The codes and datasets are in https://github.com/MidiyaZhu/KVWEFFER.
Chinese Summary: 本研究提出了一种利用领域特定词汇知识增强BERT词嵌入的方法,通过最大化类内相似性和类间差异,有效提升了情感分析等分类任务的性能。
English Summary: The study introduces a method to enhance BERT word embeddings using domain-specific lexical knowledge, which improves classification performance in tasks like sentiment analysis by maximizing within-class similarity and between-class differences.

Authors:Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Liang Lin, Cewu Lu, Xiaodan Liang
Title: EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation
Abstract:
Building Vision-Language Navigation (VLN) agents which can navigate following natural language instructions is a long-standing goal in human-robot interaction applications. Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs' reasoning ability for improving navigation, and simultaneously mitigate the domain gap between LLMs' training corpus and the VLN task. However, these approaches primarily adopt direct input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. In this paper, we propose a novel sElf-improving embodied reasoning framework for boosting LLM-based vision-language Navigation, dubbed EvolveNav. Our EvolveNav consists of two stages: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with formalized CoT labels to both activate the model's navigational reasoning capabilities and increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also introduced to encourage learning correct reasoning patterns by contrasting with wrong ones. Experimental results on the popular VLN benchmarks demonstrate the superiority of EvolveNav over previous LLM-based VLN approaches. Code is available at https://github.com/expectorlin/EvolveNav.
中文: 本文提出EvolveNav自演进框架,通过链式思维监督微调和自反思后训练两阶段方法,利用大语言模型的推理能力提升视觉语言导航任务的准确性与可解释性。
English: This paper introduces EvolveNav, a self-improving framework that enhances vision-language navigation by training LLMs with formalized chain-of-thought reasoning and iterative self-reflective post-training to boost both accuracy and interpretability.

Authors:Wenhao Liu, Zhenyi Lu, Xinyu Hu, Jierui Zhang, Dailin Li, Jiacheng Cen, Huilin Cao, Haiteng Wang, Yuhan Li, Kun Xie, Dandan Li, Pei Zhang, Chengbo Zhang, Yuxiang Ren, Xiaohong Huang, Yan Ma
Title: STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework
Abstract:
High-quality math datasets are crucial for advancing the reasoning abilities of large language models (LLMs). However, existing datasets often suffer from three key issues: outdated and insufficient challenging content, neglecting human-like reasoning, and limited reliability due to single-LLM generation. To address these, we introduce STORM-BORN, an ultra-challenging dataset of mathematical derivations sourced from cutting-edge academic papers, which includes dense human-like approximations and heuristic cues. To ensure the reliability and quality, we propose a novel human-in-the-loop, multi-agent data generation framework, integrating reasoning-dense filters, multi-agent collaboration, and human mathematicians' evaluations. We curated a set of 2,000 synthetic samples and deliberately selected the 100 most difficult problems. Even most advanced models like GPT-o1 solved fewer than 5% of them. Fine-tuning on STORM-BORN boosts accuracy by 7.84% (LLaMA3-8B) and 9.12% (Qwen2.5-7B). As AI approaches mathematician-level reasoning, STORM-BORN provides both a high-difficulty benchmark and a human-like reasoning training resource. Our code and dataset are publicly available at https://github.com/lwhere/STORM-BORN.
中文: STORM-BORN数据集通过从学术论文中提取极具挑战性的数学推导问题,结合人类式推理线索和多智能体生成框架,解决了现有数据集内容陈旧和可靠性不足的问题,即使最先进模型也仅能解决不足5%的题目,但使用该数据集微调可显著提升模型性能。
English: The STORM-BORN dataset addresses limitations in existing math datasets by providing ultra-challenging problems derived from academic papers with human-like reasoning cues, and its multi-agent generation framework ensures high quality, significantly boosting model performance despite low solve rates by advanced models.

Authors:Ping Wu, Guobin Shen, Dongcheng Zhao, Yuwei Wang, Yiting Dong, Yu Shi, Enmeng Lu, Feifei Zhao, Yi Zeng
Title: CVC: A Large-Scale Chinese Value Rule Corpus for Value Alignment of Large Language Models
Abstract:
Ensuring that Large Language Models (LLMs) align with mainstream human values and ethical norms is crucial for the safe and sustainable development of AI. Current value evaluation and alignment are constrained by Western cultural bias and incomplete domestic frameworks reliant on non-native rules; furthermore, the lack of scalable, rule-driven scenario generation methods makes evaluations costly and inadequate across diverse cultural contexts. To address these challenges, we propose a hierarchical value framework grounded in core Chinese values, encompassing three main dimensions, 12 core values, and 50 derived values. Based on this framework, we construct a large-scale Chinese Values Corpus (CVC) containing over 250,000 value rules enhanced and expanded through human annotation. Experimental results show that CVC-guided scenarios outperform direct generation ones in value boundaries and content diversity. In the evaluation across six sensitive themes (e.g., surrogacy, suicide), seven mainstream LLMs preferred CVC-generated options in over 70.5% of cases, while five Chinese human annotators showed an 87.5% alignment with CVC, confirming its universality, cultural relevance, and strong alignment with Chinese values. Additionally, we construct 400,000 rule-based moral dilemma scenarios that objectively capture nuanced distinctions in conflicting value prioritization across 17 LLMs. Our work establishes a culturally-adaptive benchmarking framework for comprehensive value evaluation and alignment, representing Chinese characteristics. All data are available at https://huggingface.co/datasets/Beijing-AISI/CVC, and the code is available at https://github.com/Beijing-AISI/CVC.
中文摘要:本研究提出基于中国核心价值观的分层价值框架,构建大规模中文价值观语料库(CVC)以解决大语言模型价值对齐中的文化偏见问题,实验证明CVC在价值边界界定和文化适应性方面表现优异,并为价值评估提供可扩展的基准场景。
English Summary: This study introduces a hierarchical Chinese values framework and constructs a large-scale Chinese Values Corpus (CVC) to address cultural biases in LLM alignment, demonstrating through experiments that CVC effectively enhances value boundary definition and cultural relevance while providing scalable evaluation scenarios.

Authors:Long Yao, Wenzhong Yang, Yabo Yin, Fuyuan Wei, Hongzhen Lv, Jiaren Peng, Liejun Wang, Xiaoming Tao
Title: Argument-Centric Causal Intervention Method for Mitigating Bias in Cross-Document Event Coreference Resolution
Abstract:
Cross-document Event Coreference Resolution (CD-ECR) is a fundamental task in natural language processing (NLP) that seeks to determine whether event mentions across multiple documents refer to the same real-world occurrence. However, current CD-ECR approaches predominantly rely on trigger features within input mention pairs, which induce spurious correlations between surface-level lexical features and coreference relationships, impairing the overall performance of the models. To address this issue, we propose a novel cross-document event coreference resolution method based on Argument-Centric Causal Intervention (ACCI). Specifically, we construct a structural causal graph to uncover confounding dependencies between lexical triggers and coreference labels, and introduce backdoor-adjusted interventions to isolate the true causal effect of argument semantics. To further mitigate spurious correlations, ACCI integrates a counterfactual reasoning module that quantifies the causal influence of trigger word perturbations, and an argument-aware enhancement module to promote greater sensitivity to semantically grounded information. In contrast to prior methods that depend on costly data augmentation or heuristic-based filtering, ACCI enables effective debiasing in a unified end-to-end framework without altering the underlying training procedure. Extensive experiments demonstrate that ACCI achieves CoNLL F1 of 88.4% on ECB+ and 85.2% on GVC, achieving state-of-the-art performance. The implementation and materials are available at https://github.com/era211/ACCI.
中文: 本文提出了一种新颖的跨文档事件共指消解方法ACCI,通过基于论据的因果干预消除词汇触发词的伪相关,在基准测试中实现了最先进的性能。
English: This paper introduces a novel cross-document event coreference resolution method called ACCI, which uses argument-centric causal intervention to eliminate spurious correlations from lexical triggers and achieves state-of-the-art performance on benchmark datasets.

Authors:Yulei Qin, Gang Li, Zongyi Li, Zihan Xu, Yuchen Shi, Zhekai Lin, Xiao Cui, Ke Li, Xing Sun
Title: Incentivizing Reasoning for Advanced Instruction-Following of Large Language Models
Abstract:
Existing large language models (LLMs) face challenges of following complex instructions, especially when multiple constraints are present and organized in paralleling, chaining, and branching structures. One intuitive solution, namely chain-of-thought (CoT), is expected to universally improve capabilities of LLMs. However, we find that the vanilla CoT exerts a negative impact on performance due to its superficial reasoning pattern of simply paraphrasing the instructions. It fails to peel back the compositions of constraints for identifying their relationship across hierarchies of types and dimensions. To this end, we propose RAIF, a systematic method to boost LLMs in dealing with complex instructions via incentivizing reasoning for test-time compute scaling. First, we stem from the decomposition of complex instructions under existing taxonomies and propose a reproducible data acquisition method. Second, we exploit reinforcement learning (RL) with verifiable rule-centric reward signals to cultivate reasoning specifically for instruction following. We address the shallow, non-essential nature of reasoning under complex instructions via sample-wise contrast for superior CoT enforcement. We also exploit behavior cloning of experts to facilitate steady distribution shift from fast-thinking LLMs to skillful reasoners. Extensive evaluations on seven comprehensive benchmarks confirm the validity of the proposed method, where a 1.5B LLM achieves 11.74% gains with performance comparable to a 8B LLM. Evaluation on OOD constraints also confirms the generalizability of our RAIF. Codes and data are available at https://github.com/yuleiqin/RAIF. Keywords: reinforcement learning with verifiable rewards (RLVR), instruction following, complex instructions
中文: 现有大语言模型在处理包含多重约束的复杂指令时存在困难,而提出的RAIF方法通过基于可验证奖励的强化学习来增强推理能力,显著提升了小规模模型的性能表现。
English: Current large language models struggle with complex instructions containing multiple constraints, but the proposed RAIF method uses reinforcement learning with verifiable rewards to enhance reasoning capabilities, significantly improving performance even in smaller models.

Authors:Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, Jie Xie, Wei Zhou, Wang Xu, Yuanheng Zhang, Zhou Su, Zhongwu Zhai, Xiaoming Liu, Yudong Mei, Jianming Xu, Hongyan Tian, Chongyi Wang, Chi Chen, Yuan Yao, Zhiyuan Liu, Maosong Sun
Title: AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning
Abstract:
The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability. However, practical deployment of such agents remains constrained by several key challenges. Existing training data is often noisy and lack semantic diversity, which hinders the learning of precise grounding and planning. Models trained purely by imitation tend to overfit to seen interface patterns and fail to generalize in unfamiliar scenarios. Moreover, most prior work focuses on English interfaces while overlooks the growing diversity of non-English applications such as those in the Chinese mobile ecosystem. In this work, we present AgentCPM-GUI, an 8B-parameter GUI agent built for robust and efficient on-device GUI interaction. Our training pipeline includes grounding-aware pre-training to enhance perception, supervised fine-tuning on high-quality Chinese and English trajectories to imitate human-like actions, and reinforcement fine-tuning with GRPO to improve reasoning capability. We also introduce a compact action space that reduces output length and supports low-latency execution on mobile devices. AgentCPM-GUI achieves state-of-the-art performance on five public benchmarks and a new Chinese GUI benchmark called CAGUI, reaching $96.9\%$ Type-Match and $91.3\%$ Exact-Match. To facilitate reproducibility and further research, we publicly release all code, model checkpoint, and evaluation data.
中文: 大型语言模型代理在自动化图形界面任务方面展现出潜力,但面临训练数据噪声大、泛化能力差等挑战;AgentCPM-GUI通过增强训练流程和精简动作空间,在多个基准测试中取得了领先性能。
English: Large language model agents show promise for automating GUI tasks, yet face challenges like noisy training data and limited generalization, which AgentCPM-GUI addresses through a robust training pipeline and compact action space to achieve top performance on benchmarks.

Authors:Zhiyang Qi, Takumasa Kaneko, Keiko Takamizo, Mariko Ukiyo, Michimasa Inaba
Title: KokoroChat: A Japanese Psychological Counseling Dialogue Dataset Collected via Role-Playing by Trained Counselors
Abstract:
Generating psychological counseling responses with language models relies heavily on high-quality datasets. Crowdsourced data collection methods require strict worker training, and data from real-world counseling environments may raise privacy and ethical concerns. While recent studies have explored using large language models (LLMs) to augment psychological counseling dialogue datasets, the resulting data often suffers from limited diversity and authenticity. To address these limitations, this study adopts a role-playing approach where trained counselors simulate counselor-client interactions, ensuring high-quality dialogues while mitigating privacy risks. Using this method, we construct KokoroChat, a Japanese psychological counseling dialogue dataset comprising 6,589 long-form dialogues, each accompanied by comprehensive client feedback. Experimental results demonstrate that fine-tuning open-source LLMs with KokoroChat improves both the quality of generated counseling responses and the automatic evaluation of counseling dialogues. The KokoroChat dataset is available at https://github.com/UEC-InabaLab/KokoroChat.
中文: 本研究通过专业咨询师角色扮演构建了KokoroChat日语心理辅导数据集,解决了现有大模型生成数据在多样性和真实性上的不足,实验表明使用该数据集微调模型能显著提升辅导回复质量和对话评估效果。
English: This study introduces KokoroChat, a Japanese psychological counseling dataset created through role-playing by trained counselors to overcome limitations in diversity and authenticity of existing LLM-generated data, enhancing both response quality and dialogue evaluation when fine-tuning models.

Authors:Xinyu Zhu, Mengzhou Xia, Zhepei Wei, Wei-Lin Chen, Danqi Chen, Yu Meng
Title: The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Abstract:
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for training language models (LMs) on reasoning tasks that elicit emergent long chains of thought (CoTs). Unlike supervised learning, it updates the model using both correct and incorrect samples via policy gradients. To better understand its mechanism, we decompose the learning signal into reinforcing correct responses and penalizing incorrect ones, referred to as Positive and Negative Sample Reinforcement (PSR and NSR), respectively. We train Qwen2.5-Math-7B and Qwen3-4B on a mathematical reasoning dataset and uncover a surprising result: training with only negative samples -- without reinforcing correct responses -- can be highly effective: it consistently improves performance over the base model across the entire Pass@$k$ spectrum ($k$ up to $256$), often matching or surpassing PPO and GRPO. In contrast, reinforcing only correct responses improves Pass@$1$ but degrades performance at higher $k$, due to reduced diversity. These inference-scaling trends highlight that solely penalizing incorrect responses may contribute more to performance than previously recognized. Through gradient analysis, we show that NSR works by suppressing incorrect generations and redistributing probability mass toward other plausible candidates, guided by the model's prior beliefs. It refines the model's existing knowledge rather than introducing entirely new behaviors. Building on this insight, we propose a simple variant of the RL objective that upweights NSR, and show that it consistently improves overall Pass@$k$ performance on MATH, AIME 2025, and AMC23. Our code is available at https://github.com/TianHongZXY/RLVR-Decomposed.
中文: 带有可验证奖励的强化学习(RLVR)通过惩罚错误答案来有效训练语言模型在推理任务上的表现,这种方法能抑制错误生成并重新分配概率给其他合理选项,在多项评估指标上常达到或超越传统方法的效果。
English: Reinforcement learning with verifiable rewards (RLVR) effectively trains language models on reasoning tasks by penalizing incorrect responses, which suppresses wrong answers and redistributes probability to other plausible options, often matching or surpassing traditional methods while improving performance across various evaluation metrics.

Authors:Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Min Zhang, Wen Zhang, Huajun Chen
Title: Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation
Abstract:
Multi-modal large language models (MLLMs) incorporate heterogeneous modalities into LLMs, enabling a comprehensive understanding of diverse scenarios and objects. Despite the proliferation of evaluation benchmarks and leaderboards for MLLMs, they predominantly overlook the critical capacity of MLLMs to comprehend world knowledge with structured abstractions that appear in visual form. To address this gap, we propose a novel evaluation paradigm and devise M3STR, an innovative benchmark grounded in the Multi-Modal Map for STRuctured understanding. This benchmark leverages multi-modal knowledge graphs to synthesize images encapsulating subgraph architectures enriched with multi-modal entities. M3STR necessitates that MLLMs not only recognize the multi-modal entities within the visual inputs but also decipher intricate relational topologies among them. We delineate the benchmark's statistical profiles and automated construction pipeline, accompanied by an extensive empirical analysis of 26 state-of-the-art MLLMs. Our findings reveal persistent deficiencies in processing abstractive visual information with structured knowledge, thereby charting a pivotal trajectory for advancing MLLMs' holistic reasoning capacities. Our code and data are released at https://github.com/zjukg/M3STR
中文: M3STR是一种新型基准测试,旨在评估多模态大语言模型理解视觉形式结构化知识的能力,尽管对26种先进模型进行了广泛测试,结果仍显示其推理能力存在显著不足。
English: M3STR is a novel benchmark designed to evaluate multi-modal large language models' ability to comprehend structured knowledge in visual form, revealing significant gaps in their reasoning capacities despite extensive testing of 26 advanced models.

Authors:Jisoo Mok, Ik-hwan Kim, Sangkwon Park, Sungroh Yoon
Title: Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis
Abstract:
Personalized AI assistants, a hallmark of the human-like capabilities of Large Language Models (LLMs), are a challenging application that intertwines multiple problems in LLM research. Despite the growing interest in the development of personalized assistants, the lack of an open-source conversational dataset tailored for personalization remains a significant obstacle for researchers in the field. To address this research gap, we introduce HiCUPID, a new benchmark to probe and unleash the potential of LLMs to deliver personalized responses. Alongside a conversational dataset, HiCUPID provides a Llama-3.2-based automated evaluation model whose assessment closely mirrors human preferences. We release our dataset, evaluation model, and code at https://github.com/12kimih/HiCUPID.
中文: 个性化AI助手是大型语言模型研究中的关键挑战,为解决缺乏开源对话数据的问题,HiCUPID基准被推出,它包含一个数据集和一个基于Llama-3.2的自动评估模型,其评估结果与人类偏好高度一致。
English: Personalized AI assistants represent a key challenge in LLM research, and to overcome the lack of open-source conversational data for personalization, the HiCUPID benchmark is introduced, which includes a dataset and an automated evaluation model based on Llama-3.2 that aligns with human preferences.

Authors:Yimin Du
Title: Memory-Efficient FastText: A Comprehensive Approach Using Double-Array Trie Structures and Mark-Compact Memory Management
Abstract:
FastText has established itself as a fundamental algorithm for learning word representations, demonstrating exceptional capability in handling out-of-vocabulary words through character-level n-gram embeddings. However, its hash-based bucketing mechanism introduces critical limitations for large-scale industrial deployment: hash collisions cause semantic drift, and memory requirements become prohibitively expensive when dealing with real-world vocabularies containing millions of terms. This paper presents a comprehensive memory optimization framework that fundamentally reimagines FastText's memory management through the integration of double-array trie (DA-trie) structures and mark-compact garbage collection principles. Our approach leverages the linguistic insight that n-grams sharing common prefixes or suffixes exhibit highly correlated embeddings due to co-occurrence patterns in natural language. By systematically identifying and merging semantically similar embeddings based on structural relationships, we achieve compression ratios of 4:1 to 10:1 while maintaining near-perfect embedding quality. The algorithm consists of four sophisticated phases: prefix trie construction with embedding mapping, prefix-based similarity compression, suffix-based similarity compression, and mark-compact memory reorganization. Comprehensive experiments on a 30-million Chinese vocabulary dataset demonstrate memory reduction from over 100GB to approximately 30GB with negligible performance degradation. Our industrial deployment results show significant cost reduction, faster loading times, and improved model reliability through the elimination of hash collision artifacts. Code and experimental implementations are available at: https://github.com/initial-d/me_fasttext
中文摘要:本文提出了一种针对FastText的内存优化框架,通过结合双数组字典树和标记-压缩原理来识别语义相似的n-gram进行嵌入压缩,在保持性能的同时实现了4:1至10:1的压缩比。
English Summary: This paper introduces a memory optimization framework for FastText that combines double-array tries and mark-compact principles to compress embeddings by identifying semantically similar n-grams, achieving 4:1 to 10:1 compression ratios while preserving performance.

Authors:Shufeng Kong, Xingru Yang, Yuanyuan Wei, Zijie Wang, Hao Tang, Jiuqi Qin, Shuting Lan, Yingheng Wang, Junwen Bai, Zhuangbin Chen, Zibin Zheng, Caihua Liu, Hao Liang
Title: MTCMB: A Multi-Task Benchmark Framework for Evaluating LLMs on Knowledge, Reasoning, and Safety in Traditional Chinese Medicine
Abstract:
Traditional Chinese Medicine (TCM) is a holistic medical system with millennia of accumulated clinical experience, playing a vital role in global healthcare-particularly across East Asia. However, the implicit reasoning, diverse textual forms, and lack of standardization in TCM pose major challenges for computational modeling and evaluation. Large Language Models (LLMs) have demonstrated remarkable potential in processing natural language across diverse domains, including general medicine. Yet, their systematic evaluation in the TCM domain remains underdeveloped. Existing benchmarks either focus narrowly on factual question answering or lack domain-specific tasks and clinical realism. To fill this gap, we introduce MTCMB-a Multi-Task Benchmark for Evaluating LLMs on TCM Knowledge, Reasoning, and Safety. Developed in collaboration with certified TCM experts, MTCMB comprises 12 sub-datasets spanning five major categories: knowledge QA, language understanding, diagnostic reasoning, prescription generation, and safety evaluation. The benchmark integrates real-world case records, national licensing exams, and classical texts, providing an authentic and comprehensive testbed for TCM-capable models. Preliminary results indicate that current LLMs perform well on foundational knowledge but fall short in clinical reasoning, prescription planning, and safety compliance. These findings highlight the urgent need for domain-aligned benchmarks like MTCMB to guide the development of more competent and trustworthy medical AI systems. All datasets, code, and evaluation tools are publicly available at: https://github.com/Wayyuanyuan/MTCMB.
中文摘要:本文提出了MTCMB这一综合性多任务基准,通过与中医专家合作开发,系统评估大语言模型在中医知识、临床推理及安全合规方面的能力,发现现有模型虽掌握基础知识,但在实际临床应用方面存在不足。
English Summary: This paper introduces MTCMB, a comprehensive multi-task benchmark developed with TCM experts to systematically evaluate large language models' capabilities in Traditional Chinese Medicine knowledge, clinical reasoning, and safety compliance, revealing current models' strengths in basic knowledge but deficiencies in practical clinical applications.

Authors:SungHo Kim, Nayeon Kim, Taehee Jeon, SangKeun Lee
Title: Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean
Abstract:
We introduce the $\underline{Ko}rean \underline{G}rammar \underline{E}valuation Bench\underline{M}ark (KoGEM)$, designed to assess the linguistic competence of LLMs and humans in Korean. KoGEM consists of 1.5k multiple-choice QA pairs covering five main categories and 16 subcategories. The zero-shot evaluation of 27 LLMs of various sizes and types reveals that while LLMs perform remarkably well on straightforward tasks requiring primarily definitional knowledge, they struggle with tasks that demand the integration of real-world experiential knowledge, such as phonological rules and pronunciation. Furthermore, our in-depth analysis suggests that incorporating such experiential knowledge could enhance the linguistic competence of LLMs. With KoGEM, we not only highlight the limitations of current LLMs in linguistic competence but also uncover hidden facets of LLMs in linguistic competence, paving the way for enhancing comprehensive language understanding. Our code and dataset are available at: https://github.com/SungHo3268/KoGEM.
中文: KoGEM韩语语法评估基准测试了27个大语言模型,发现其在定义性知识任务表现出色,但在需要结合现实经验知识时存在不足,表明融入经验知识可提升语言能力。
English: KoGEM is a Korean grammar benchmark evaluating 27 LLMs, revealing their proficiency in definitional tasks but struggles with real-world knowledge integration, suggesting experiential learning could enhance their linguistic competence.

Authors:Antonia Karamolegkou, Oliver Eberle, Phillip Rust, Carina Kauf, Anders Søgaard
Title: Trick or Neat: Adversarial Ambiguity and Language Model Evaluation
Abstract:
Detecting ambiguity is important for language understanding, including uncertainty estimation, humour detection, and processing garden path sentences. We assess language models' sensitivity to ambiguity by introducing an adversarial ambiguity dataset that includes syntactic, lexical, and phonological ambiguities along with adversarial variations (e.g., word-order changes, synonym replacements, and random-based alterations). Our findings show that direct prompting fails to robustly identify ambiguity, while linear probes trained on model representations can decode ambiguity with high accuracy, sometimes exceeding 90\%. Our results offer insights into the prompting paradigm and how language models encode ambiguity at different layers. We release both our code and data: https://github.com/coastalcph/lm_ambiguity.
Chinese: 本研究通过对抗性数据集评估语言模型检测歧义的能力,发现基于模型表征训练的线性探针显著优于直接提示法,在解码歧义时准确率超过90%。
English: This study evaluates language models' ability to detect ambiguity through an adversarial dataset, finding that linear probes trained on model representations significantly outperform direct prompting, achieving over 90% accuracy in decoding ambiguity.

Authors:Mark Muchane, Sean Richardson, Kiho Park, Victor Veitch
Title: Incorporating Hierarchical Semantics in Sparse Autoencoder Architectures
Abstract:
Sparse dictionary learning (and, in particular, sparse autoencoders) attempts to learn a set of human-understandable concepts that can explain variation on an abstract space. A basic limitation of this approach is that it neither exploits nor represents the semantic relationships between the learned concepts. In this paper, we introduce a modified SAE architecture that explicitly models a semantic hierarchy of concepts. Application of this architecture to the internal representations of large language models shows both that semantic hierarchy can be learned, and that doing so improves both reconstruction and interpretability. Additionally, the architecture leads to significant improvements in computational efficiency.
Chinese: 本文提出了一种改进的稀疏自编码器架构,能够学习概念间的语义层次结构,从而提升大型语言模型的重构能力、可解释性及计算效率。
English: This paper introduces a modified sparse autoencoder architecture that learns a semantic hierarchy of concepts, improving reconstruction, interpretability, and computational efficiency in large language models.

Authors:Xintong Sun, Chi Wei, Minghao Tian, Shiwen Ni
Title: Earley-Driven Dynamic Pruning for Efficient Structured Decoding
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.
中文摘要:ZapFormat基于Earley算法提出动态剪枝策略,有效降低约束解码的计算负担,使Formatron引擎在保持高合规性的同时,将推理速度提升高达两倍,并适用于多种大语言模型架构。
English Summary: ZapFormat introduces a dynamic pruning strategy based on the Earley algorithm to reduce computational overhead in constrained decoding, enabling the Formatron engine to maintain high compliance while doubling inference speed across various LLM architectures.

Authors:Metehan Oguz, Yavuz Bakman, Duygu Nur Yaldiz
Title: Un-considering Contextual Information: Assessing LLMs' Understanding of Indexical Elements
Abstract:
Large Language Models (LLMs) have demonstrated impressive performances in tasks related to coreference resolution. However, previous studies mostly assessed LLM performance on coreference resolution with nouns and third person pronouns. This study evaluates LLM performance on coreference resolution with indexical like I, you, here and tomorrow, which come with unique challenges due to their linguistic properties. We present the first study examining how LLMs interpret indexicals in English, releasing the English Indexical Dataset with 1600 multiple-choice questions. We evaluate pioneering LLMs, including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and DeepSeek V3. Our results reveal that LLMs exhibit an impressive performance with some indexicals (I), while struggling with others (you, here, tomorrow), and that syntactic cues (e.g. quotation) contribute to LLM performance with some indexicals, while they reduce performance with others. Code and data are available at: https://github.com/metehanoguzz/LLMs-Indexicals-English.
中文: 本研究评估了大语言模型对“我”、“你”等指示词进行指代消解的能力,发现模型在不同指示词上表现不一,且句法线索对性能的影响存在差异。
English: This study evaluates large language models' ability to resolve coreference for indexicals like "I" and "you," revealing varied performance across different terms and the mixed impact of syntactic cues.

Authors:Saibo Geng, Nathan Ranchin, Yunzhen yao, Maxime Peyrard, Chris Wendler, Michael Gastpar, Robert West
Title: zip2zip: Inference-Time Adaptive Vocabularies for Language Models via Token Compression
Abstract:
Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized for general-purpose corpora. These tokenizers' fixed vocabularies often fail to adapt to domain- or language-specific inputs, leading to longer token sequences and higher computational costs. We introduce zip2zip, a framework that enables LLMs to dynamically adjust token vocabulary at inference time, allowing for fewer generated tokens and thus faster inference. zip2zip consists of three key components: (1) a tokenizer based on Lempel-Ziv-Welch (LZW) compression that incrementally compresses tokens into reusable "hypertokens" on the fly; (2) an embedding layer that computes embeddings for newly formed hypertokens at runtime; and (3) a causal language modeling variant that trains the model to operate on hypertokenized, compressed sequences. We show that an existing LLM can be zip2zip-fied in 10 GPU-hours via parameter-efficient finetuning. The resulting zip2zip LLMs effectively learn to use hypertokens at inference time, reducing input and output sequence length by 20-60\%, with significant improvements in inference latency.
中文摘要:zip2zip框架让大型语言模型能够在推理时动态调整词汇表,通过实时将令牌压缩为可复用的超令牌,使序列长度减少20-60%,并显著提升推理速度。
English Summary: The zip2zip framework enables large language models to dynamically adjust their token vocabulary during inference, reducing sequence length by 20-60% and significantly improving latency through real-time token compression into reusable hypertokens.

Authors:Amir Hossein Kargaran, Yihong Liu, François Yvon, Hinrich Schütze
Title: How Programming Concepts and Neurons Are Shared in Code Language Models
Abstract:
Several studies have explored the mechanisms of large language models (LLMs) in coding tasks, but most have focused on programming languages (PLs) in a monolingual setting. In this paper, we investigate the relationship between multiple PLs and English in the concept space of LLMs. We perform a few-shot translation task on 21 PL pairs using two Llama-based models. By decoding the embeddings of intermediate layers during this task, we observe that the concept space is closer to English (including PL keywords) and assigns high probabilities to English tokens in the second half of the intermediate layers. We analyze neuron activations for 11 PLs and English, finding that while language-specific neurons are primarily concentrated in the bottom layers, those exclusive to each PL tend to appear in the top layers. For PLs that are highly aligned with multiple other PLs, identifying language-specific neurons is not feasible. These PLs also tend to have a larger keyword set than other PLs and are closer to the model's concept space regardless of the input/output PL in the translation task. Our findings provide insights into how LLMs internally represent PLs, revealing structural patterns in the model's concept space. Code is available at https://github.com/cisnlp/code-specific-neurons.
中文摘要:本研究探索大型语言模型中多种编程语言与英语在概念空间中的关系,发现模型的概念空间更接近英语,且不同编程语言在模型各层中的特定神经元分布存在差异,揭示了编程语言内部表征的结构模式。
English Summary: This study investigates how large language models represent multiple programming languages in relation to English, finding that their concept space aligns more closely with English and that language-specific neurons are distributed differently across model layers depending on programming language characteristics.

Authors:Phan Anh Duong, Cat Luong, Divyesh Bommana, Tianyu Jiang
Title: CHEER-Ekman: Fine-grained Embodied Emotion Classification
Abstract:
Emotions manifest through physical experiences and bodily reactions, yet identifying such embodied emotions in text remains understudied. We present an embodied emotion classification dataset, CHEER-Ekman, extending the existing binary embodied emotion dataset with Ekman's six basic emotion categories. Using automatic best-worst scaling with large language models, we achieve performance superior to supervised approaches on our new dataset. Our investigation reveals that simplified prompting instructions and chain-of-thought reasoning significantly improve emotion recognition accuracy, enabling smaller models to achieve competitive performance with larger ones. Our dataset is publicly available at: https://github.com/menamerai/cheer-ekman.
中文摘要:研究人员开发了CHEER-Ekman数据集,基于埃克曼六种基本情绪扩展了具身情绪分类,通过优化提示指令使大型语言模型超越传统方法,并让较小模型实现了与之媲美的性能。
English Summary: Researchers developed CHEER-Ekman, an enhanced dataset for classifying embodied emotions using Ekman's six basic categories, where large language models with optimized prompts outperformed traditional methods and enabled smaller models to achieve competitive accuracy.

Authors:Yu Zheng, Yuan Yuan, Yong Li, Paolo Santi
Title: Probing Neural Topology of Large Language Models
Abstract:
Probing large language models (LLMs) has yielded valuable insights into their internal mechanisms by linking neural representations to interpretable semantics. However, how neurons functionally co-activate with each other to give rise to emergent capabilities remains largely unknown, hindering a deeper understanding and safer development of LLMs. In this work, we introduce graph probing, a method for uncovering the functional connectivity topology of LLM neurons and relating it to language generation performance. By analyzing internal neural graphs across diverse LLM families and scales, we discover a universal predictability of next-token prediction performance using only neural topology. This predictability is robust even when retaining just 1% of neuron connections or probing models after only 8 pretraining steps, highlighting the sparsity and early emergence of topological patterns. Further graph matching analysis suggests that, despite significant distinctions in architectures, parameters, and training data, different LLMs develop intricate and consistent neural topological structures that may form the foundation for their language generation abilities. Codes and data for the graph probing toolbox are released at https://github.com/DavyMorgan/llm-graph-probing.
中文摘要:本研究提出图探针方法,揭示大语言模型中神经拓扑结构比神经激活更能有效预测语言生成性能,为提升模型效率与安全性提供了新途径。
English Summary: This study introduces graph probing to reveal how neural topology in large language models predicts language generation performance more effectively than neural activations, offering potential applications in enhancing model efficiency and safety.

Authors:Alexander Sergeev, Valeriya Goloviznina, Mikhail Melnichenko, Evgeny Kotelnikov
Title: Talking to Data: Designing Smart Assistants for Humanities Databases
Abstract:
Access to humanities research databases is often hindered by the limitations of traditional interaction formats, particularly in the methods of searching and response generation. This study introduces an LLM-based smart assistant designed to facilitate natural language communication with digital humanities data. The assistant, developed in a chatbot format, leverages the RAG approach and integrates state-of-the-art technologies such as hybrid search, automatic query generation, text-to-SQL filtering, semantic database search, and hyperlink insertion. To evaluate the effectiveness of the system, experiments were conducted to assess the response quality of various language models. The testing was based on the Prozhito digital archive, which contains diary entries from predominantly Russian-speaking individuals who lived in the 20th century. The chatbot is tailored to support anthropology and history researchers, as well as non-specialist users with an interest in the field, without requiring prior technical training. By enabling researchers to query complex databases with natural language, this tool aims to enhance accessibility and efficiency in humanities research. The study highlights the potential of Large Language Models to transform the way researchers and the public interact with digital archives, making them more intuitive and inclusive. Additional materials are presented in GitHub repository: https://github.com/alekosus/talking-to-data-intersys2025.
中文摘要:本研究开发了一种基于大语言模型的智能助手,采用RAG技术实现数字人文档案的自然语言查询,使研究者和公众无需专业技术背景即可便捷访问。
English Summary: This study develops an LLM-based chatbot using RAG technology to enable natural language queries for digital humanities archives, enhancing accessibility for researchers and the public without technical expertise.

Authors:Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
Title: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Abstract:
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it's less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
中文摘要:与英语或多语言预训练相比,专门针对荷兰语进行预训练的Wav2Vec2自监督模型能更好地编码荷兰语音位和词汇特征,这种语言特异性优势可通过探测方法检测,并与自动语音识别性能提升相关。
English Summary: Pre-training self-supervised Wav2Vec2 models specifically on Dutch enhances the encoding of Dutch phonetic and lexical features compared to English or multilingual pre-training, with this language-specific advantage detectable through probing methods and correlating with improved automatic speech recognition performance.

Authors:Dren Fazlija, Arkadij Orlov, Sandipan Sikdar
Title: ACCESS DENIED INC: The First Benchmark Environment for Sensitivity Awareness
Abstract:
Large language models (LLMs) are increasingly becoming valuable to corporate data management due to their ability to process text from various document formats and facilitate user interactions through natural language queries. However, LLMs must consider the sensitivity of information when communicating with employees, especially given access restrictions. Simple filtering based on user clearance levels can pose both performance and privacy challenges. To address this, we propose the concept of sensitivity awareness (SA), which enables LLMs to adhere to predefined access rights rules. In addition, we developed a benchmarking environment called ACCESS DENIED INC to evaluate SA. Our experimental findings reveal significant variations in model behavior, particularly in managing unauthorized data requests while effectively addressing legitimate queries. This work establishes a foundation for benchmarking sensitivity-aware language models and provides insights to enhance privacy-centric AI systems in corporate environments.

Authors:Youngmin Kim, Jiwan Chung, Jisoo Kim, Sunghyun Lee, Sangkyu Lee, Junhyeok Kim, Cheoljong Yang, Youngjae Yu
Title: Speaking Beyond Language: A Large-Scale Multimodal Dataset for Learning Nonverbal Cues from Video-Grounded Dialogues
Abstract:
Nonverbal communication is integral to human interaction, with gestures, facial expressions, and body language conveying critical aspects of intent and emotion. However, existing large language models (LLMs) fail to effectively incorporate these nonverbal elements, limiting their capacity to create fully immersive conversational experiences. We introduce MARS, a multimodal language model designed to understand and generate nonverbal cues alongside text, bridging this gap in conversational AI. Our key innovation is VENUS, a large-scale dataset comprising annotated videos with time-aligned text, facial expressions, and body language. Leveraging VENUS, we train MARS with a next-token prediction objective, combining text with vector-quantized nonverbal representations to achieve multimodal understanding and generation within a unified framework. Based on various analyses of the VENUS datasets, we validate its substantial scale and high effectiveness. Our quantitative and qualitative results demonstrate that MARS successfully generates text and nonverbal languages, corresponding to conversational input.
中文: MARS是一种多模态语言模型,通过结合文本与表情、肢体语言等非语言线索,并利用VENUS数据集训练,实现了更沉浸式的对话AI体验。
English: MARS is a multimodal language model that integrates nonverbal cues like facial expressions and body language with text, using the VENUS dataset to enable more immersive conversational AI experiences.

Authors:Yongqi Li, Shen Zhou, Xiaohu Li, Xin Miao, Jintao Wen, Mayi Xu, Jianhao Chen, Birong Pan, Hankun Kang, Yuanyuan Zhu, Ming Zhong, Tieyun Qian
Title: Aligning VLM Assistants with Personalized Situated Cognition
Abstract:
Vision-language models (VLMs) aligned with general human objectives, such as being harmless and hallucination-free, have become valuable assistants of humans in managing visual tasks. However, people with diversified backgrounds have different cognition even in the same situation. Consequently, they may have personalized expectations for VLM assistants. This highlights the urgent need to align VLM assistants with personalized situated cognition for real-world assistance. To study this problem, we first simplify it by characterizing individuals based on the sociological concept of Role-Set. Then, we propose to evaluate the individuals' actions to examine whether the personalized alignment is achieved. Further, we construct a benchmark named PCogAlignBench, which includes 18k instances and 20 individuals with different Role-Sets. Finally, we present a framework called PCogAlign, which constructs a cognition-aware and action-based reward model for personalized alignment. Experimental results and human evaluations demonstrate the reliability of the PCogAlignBench and the effectiveness of our proposed PCogAlign. We will open-source the constructed benchmark and code at https://github.com/NLPGM/PCogAlign.
中文摘要:视觉语言模型需要个性化对齐以适应不同的人类认知,为此构建了PCogAlignBench基准和PCogAlign框架,以实现有效的个性化辅助。
English Summary: Vision-language models need personalized alignment to match diverse human cognition, prompting the creation of PCogAlignBench and PCogAlign framework for effective individualized assistance.

Authors:Jinfeng Zhou, Yuxuan Chen, Yihan Shi, Xuanming Zhang, Leqi Lei, Yi Feng, Zexuan Xiong, Miao Yan, Xunzhi Wang, Yaru Cao, Jianing Yin, Shuai Wang, Quanyu Dai, Zhenhua Dong, Hongning Wang, Minlie Huang
Title: SocialEval: Evaluating Social Intelligence of Large Language Models
Abstract:
LLMs exhibit promising Social Intelligence (SI) in modeling human behavior, raising the need to evaluate LLMs' SI and their discrepancy with humans. SI equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals. This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation, which existing work fails to address. To this end, we propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts. Each script is structured as a world tree that contains plot lines driven by interpersonal ability, providing a comprehensive view of how LLMs navigate social interactions. Experiments show that LLMs fall behind humans on both SI evaluations, exhibit prosociality, and prefer more positive social behaviors, even if they lead to goal failure. Analysis of LLMs' formed representation space and neuronal activations reveals that LLMs have developed ability-specific functional partitions akin to the human brain.
中文摘要:大语言模型展现出有前景的社会智能,但在目标达成和人际能力方面仍落后于人类,这通过整合结果导向与过程导向评估的SocialEval基准测试得以验证。
English Summary: LLMs demonstrate promising social intelligence but still lag behind humans in both goal achievement and interpersonal abilities, as evaluated by the SocialEval benchmark which integrates outcome- and process-oriented assessments.

Authors:Keyuan Cheng, Xudong Shen, Yihao Yang, Tengyue Wang, Yang Cao, Muhammad Asif Ali, Hanbin Wang, Lijie Hu, Di Wang
Title: CODEMENV: Benchmarking Large Language Models on Code Migration
Abstract:
Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In this work, we introduce CODEMENV: Code Migration Across Environment, a new benchmark specifically designed to assess LLMs' abilities in code migration scenarios. CODEMENV consists of 922 examples spanning 19 Python and Java packages, and covers three core tasks: (1) identifying functions incompatible with specific versions, (2) detecting changes in function definitions, and (3) adapting code to target environments. Experimental evaluation with seven LLMs on CODEMENV yields an average pass@1 rate of 26.50%, with GPT-4O achieving the highest score at 43.84%. Key findings include: (i) LLMs tend to be more proficient with newer function versions, which aids in migrating legacy code, and (ii) LLMs sometimes exhibit logical inconsistencies by identifying function changes irrelevant to the intended migration environment. The datasets are available at https://github.com/xdshen-ai/Benchmark-of-Code-Migration.
中文: 该摘要介绍了CODEMENV这一评估大语言模型在代码迁移任务中表现的新基准,结果显示平均通过率为26.50%,并揭示了模型对新版本函数的熟练度以及偶尔出现的逻辑不一致问题。
English: This abstract introduces CODEMENV, a new benchmark for evaluating large language models' performance in code migration tasks, revealing an average pass rate of 26.50% and highlighting both their proficiency with newer function versions and occasional logical inconsistencies.

Authors:Ryo Fujii, Hideo Saito, Ryo Hachiuma
Title: Towards Predicting Any Human Trajectory In Context
Abstract:
Predicting accurate future trajectories of pedestrians is essential for autonomous systems but remains a challenging task due to the need for adaptability in different environments and domains. A common approach involves collecting scenario-specific data and performing fine-tuning via backpropagation. However, this process is often impractical on edge devices due to constrained computational resources. To address this challenge, we introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian trajectory prediction that enables rapid adaptation without fine-tuning on the scenario-specific data. We propose a spatio-temporal similarity-based example selection (STES) method that selects relevant examples from previously observed trajectories within the same scene by identifying similar motion patterns at corresponding locations. To further refine this selection, we introduce prediction-guided example selection (PG-ES), which selects examples based on both the past trajectory and the predicted future trajectory, rather than relying solely on the past trajectory. This approach allows the model to account for long-term dynamics when selecting examples. Finally, instead of relying on small real-world datasets with limited scenario diversity, we train our model on a large-scale synthetic dataset to enhance its prediction ability by leveraging in-context examples. Extensive experiments demonstrate that TrajICL achieves remarkable adaptation across both in-domain and cross-domain scenarios, outperforming even fine-tuned approaches across multiple public benchmarks. The code will be released at https://fujiry0.github.io/TrajICL-project-page.

Authors:Nidhi Kowtal, Raviraj Joshi
Title: L3Cube-MahaEmotions: A Marathi Emotion Recognition Dataset with Synthetic Annotations using CoTR prompting and Large Language Models
Abstract:
Emotion recognition in low-resource languages like Marathi remains challenging due to limited annotated data. We present L3Cube-MahaEmotions, a high-quality Marathi emotion recognition dataset with 11 fine-grained emotion labels. The training data is synthetically annotated using large language models (LLMs), while the validation and test sets are manually labeled to serve as a reliable gold-standard benchmark. Building on the MahaSent dataset, we apply the Chain-of-Translation (CoTR) prompting technique, where Marathi sentences are translated into English and emotion labeled via a single prompt. GPT-4 and Llama3-405B were evaluated, with GPT-4 selected for training data annotation due to superior label quality. We evaluate model performance using standard metrics and explore label aggregation strategies (e.g., Union, Intersection). While GPT-4 predictions outperform fine-tuned BERT models, BERT-based models trained on synthetic labels fail to surpass GPT-4. This highlights both the importance of high-quality human-labeled data and the inherent complexity of emotion recognition. An important finding of this work is that generic LLMs like GPT-4 and Llama3-405B generalize better than fine-tuned BERT for complex low-resource emotion recognition tasks. The dataset and model are shared publicly at https://github.com/l3cube-pune/MarathiNLP
中文摘要:本研究推出L3Cube-MahaEmotions马拉地语情感数据集,通过合成标注与人工验证相结合的方式,证明通用大语言模型(如GPT-4)在低资源情感识别任务中优于微调的BERT模型。
English Summary: This study introduces L3Cube-MahaEmotions, a Marathi emotion dataset combining synthetic LLM annotations with human-verified labels, demonstrating that general-purpose LLMs like GPT-4 outperform fine-tuned BERT models in low-resource emotion recognition tasks.

Authors:Keyuan Cheng, Zijian Kan, Zhixian He, Zhuoran Zhang, Muhammad Asif Ali, Ke Xu, Lijie Hu, Di Wang
Title: COMPKE: Complex Question Answering under Knowledge Editing
Abstract:
Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.
中文: 本文提出了COMPKE新基准,通过复杂的现实问题评估大语言模型的知识编辑能力,发现不同模型和方法的效果存在显著差异。
English: This paper introduces COMPKE, a new benchmark designed to evaluate knowledge editing in large language models through complex, real-life questions, revealing significant performance variations across different models and methods.

Authors:Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin
Title: Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Abstract:
Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the "truth direction", which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at https://github.com/colored-dye/truthfulness_probe_generalization
Chinese Summary: 本研究探讨了大语言模型中“真实性方向”的一致性和泛化能力,发现其因模型而异,并能有效应用于多种任务以提升输出真实性,从而增强用户信任。
English Summary: This study investigates the consistency and generalizability of "truth directions" in large language models, finding they vary across models and can be effectively applied to enhance truthfulness in various tasks, thereby improving user trust.

Authors:Rong Wu, Pinlong Cai, Jianbiao Mei, Licheng Wen, Tao Hu, Xuemeng Yang, Daocheng Fu, Botian Shi
Title: KG-TRACES: Enhancing Large Language Models with Knowledge Graph-constrained Trajectory Reasoning and Attribution Supervision
Abstract:
Large language models (LLMs) have made remarkable strides in various natural language processing tasks, but their performance on complex reasoning problems remains hindered by a lack of explainability and trustworthiness. This issue, often manifesting as hallucinations or unattributable reasoning processes, limits their applicability in complex reasoning scenarios. To address this, we propose Knowledge Graph-constrained Trajectory Reasoning Attribution and Chain Explanation Supervision (KG-TRACES), a novel framework that enhances the reasoning ability of LLMs through explicit supervision over reasoning paths and processes. KG-TRACES jointly supervises the model to: (1) predict symbolic relation paths, (2) predict full triple-level reasoning paths, and (3) generate attribution-aware reasoning processes grounded in the reasoning paths. At inference phase, the model adapts to both KG-available and KG-unavailable scenarios, retrieving reasoning paths from a KG when possible or predicting plausible reasoning paths with only intrinsic knowledge when not. This design enables the model to reason in an explainable and source-attributable pattern. Through extensive experiments on complex reasoning tasks, we demonstrate that KG-TRACES significantly outperforms existing SOTA: it improves Hits@1 by 1.6% and F1 by 4.7% on WebQSP, and achieves improvements of 4.8% in Hits@1 and 2.1% in F1 on CWQ. Moreover, we show its transferability to specialized domains such as medicine. By visualizing the intermediate steps of reasoning processes, we further show that the explicit supervision introduced by KG-TRACES leads to more stable and goal-directed reasoning processes, aligning closely with correct answers. Code is available at https://github.com/Edaizi/KG-TRACES.
Chinese: 提出的KG-TRACES框架通过监督推理路径和过程来增强大语言模型的复杂推理能力,在知识图谱可用与不可用场景下均实现了更优的性能与可解释性。
English: The proposed KG-TRACES framework enhances LLMs' complex reasoning by supervising reasoning paths and processes, achieving superior performance and explainability in both KG-available and KG-unavailable scenarios.

Authors:Md Tahmid Rahman Laskar, Israt Jahan, Elham Dolatabadi, Chun Peng, Enamul Hoque, Jimmy Huang
Title: Improving Automatic Evaluation of Large Language Models (LLMs) in Biomedical Relation Extraction via LLMs-as-the-Judge
Abstract:
Large Language Models (LLMs) have demonstrated impressive performance in biomedical relation extraction, even in zero-shot scenarios. However, evaluating LLMs in this task remains challenging due to their ability to generate human-like text, often producing synonyms or abbreviations of gold-standard answers, making traditional automatic evaluation metrics unreliable. On the other hand, while human evaluation is more reliable, it is costly and time-consuming, making it impractical for real-world applications. This paper investigates the use of LLMs-as-the-Judge as an alternative evaluation method for biomedical relation extraction. We benchmark 8 LLMs as judges to evaluate the responses generated by 5 other LLMs across 3 biomedical relation extraction datasets. Unlike other text-generation tasks, we observe that LLM-based judges perform quite poorly (usually below 50% accuracy) in the biomedical relation extraction task. Our findings reveal that it happens mainly because relations extracted by LLMs do not adhere to any standard format. To address this, we propose structured output formatting for LLM-generated responses that helps LLM-Judges to improve their performance by about 15% (on average). We also introduce a domain adaptation technique to further enhance LLM-Judge performance by effectively transferring knowledge between datasets. We release both our human-annotated and LLM-annotated judgment data (36k samples in total) for public use here: https://github.com/tahmedge/llm_judge_biomedical_re.
中文摘要:本研究探讨使用大语言模型作为生物医学关系抽取的评估者,发现其低准确率源于输出格式不一致,并提出结构化格式和领域自适应方法,将性能平均提升约15%。
English Summary: This study explores using LLMs as judges for biomedical relation extraction, finding their low accuracy stems from inconsistent output formats and proposing structured formatting and domain adaptation to boost performance by about 15%.

Authors:Boheng Sheng, Jiacheng Yao, Meicong Zhang, Guoxiu He
Title: Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models
Abstract:
Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks separating semantically relevant content, leading to ambiguity and compromising accurate understanding. To overcome this limitation, we propose a straightforward approach for dynamically separating and selecting chunks of long context, facilitating a more streamlined input for LLMs. In particular, we compute semantic similarities between adjacent sentences, using lower similarities to adaptively divide long contexts into variable-length chunks. We further train a question-aware classifier to select sensitive chunks that are critical for answering specific questions. Experimental results on both single-hop and multi-hop question-answering benchmarks show that the proposed approach consistently outperforms strong baselines. Notably, it maintains robustness across a wide range of input lengths, handling sequences of up to 256k tokens. Our datasets and code are available at the following link: https://github.com/ECNU-Text-Computing/DCS
中文摘要:该方法通过语义相似度动态分割长文本为变长片段,并利用问题感知分类器筛选关键内容,显著提升大语言模型在长文本问答任务中的表现。
English Summary: The proposed method dynamically segments long texts into variable-length chunks based on semantic similarity and uses a question-aware classifier to select key segments, significantly improving LLMs' performance on long-context question-answering tasks.

Authors:Zihang Liu, Tianyu Pang, Oleg Balabanov, Chaoqun Yang, Tianjin Huang, Lu Yin, Yaoqing Yang, Shiwei Liu
Title: LIFT the Veil for the Truth: Principal Weights Emerge after Rank Reduction for Reasoning-Focused Supervised Fine-Tuning
Abstract:
Recent studies have shown that supervised fine-tuning of LLMs on a small number of high-quality datasets can yield strong reasoning capabilities. However, full fine-tuning (Full FT), while powerful, is computationally expensive and susceptible to overfitting and catastrophic forgetting, particularly when data is limited. Sparse fine-tuning, which previously achieved notable success by updating only a small subset of model parameters, offers a promising trade-off between efficiency and effectiveness. Yet, it has lagged behind in the LLM era due to the difficulty of identifying parameters truly critical for reasoning. In this work, we state that weights with the largest magnitude after low-rank approximation are critical weights for fine-tuning, which we call Principal Weights. Surprisingly, while magnitude-based sparse fine-tuning performs poorly as a baseline on LLM fine-tuning, it becomes highly effective after rank reduction. These insights motivate our method: Low-rank Informed Sparse Fine-Tuning (LIFT). LIFT only updates the top 5% Principal Weights throughout training and consistently achieves better performance on reasoning tasks than Full FT, while maintaining memory efficiency on par with popular parameter-efficient fine-tuning methods. In addition to strong performance on target domains such as arithmetic reasoning, LIFT also retains up to 20% more source-domain knowledge, compared to Full FT and LoRA. Our code is available at: https://github.com/zihanghliu/LIFT.
中文: 近期研究提出低秩信息稀疏微调方法LIFT,通过低秩近似识别并仅更新前5%的主权重,在推理任务中表现优于全参数微调,同时保持高内存效率并显著保留源领域知识。
English: Recent research introduces Low-rank Informed Sparse Fine-Tuning (LIFT), a method that updates only the top 5% of principal weights identified through low-rank approximation, achieving superior reasoning performance and memory efficiency compared to full fine-tuning while better preserving source-domain knowledge.

Authors:Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed
Title: DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments
Abstract:
Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.
Chinese: DefenderBench 是一个评估大语言模型在网络安全任务中表现的开源工具包,其中 Claude-3.7-sonnet 在基准测试中以 81.65 分获得最高分。
English: DefenderBench is an open-source toolkit for evaluating LLM agents in cybersecurity tasks, with Claude-3.7-sonnet achieving the highest score of 81.65 in benchmark tests.

Authors:Li Zhang, Morgan Gray, Jaromir Savelka, Kevin D. Ashley
Title: Measuring Faithfulness and Abstention: An Automated Pipeline for Evaluating LLM-Generated 3-ply Case-Based Legal Arguments
Abstract:
Large Language Models (LLMs) demonstrate potential in complex legal tasks like argument generation, yet their reliability remains a concern. Building upon pilot work assessing LLM generation of 3-ply legal arguments using human evaluation, this paper introduces an automated pipeline to evaluate LLM performance on this task, specifically focusing on faithfulness (absence of hallucination), factor utilization, and appropriate abstention. We define hallucination as the generation of factors not present in the input case materials and abstention as the model's ability to refrain from generating arguments when instructed and no factual basis exists. Our automated method employs an external LLM to extract factors from generated arguments and compares them against the ground-truth factors provided in the input case triples (current case and two precedent cases). We evaluated eight distinct LLMs on three tests of increasing difficulty: 1) generating a standard 3-ply argument, 2) generating an argument with swapped precedent roles, and 3) recognizing the impossibility of argument generation due to lack of shared factors and abstaining. Our findings indicate that while current LLMs achieve high accuracy (over 90%) in avoiding hallucination on viable argument generation tests (Tests 1 & 2), they often fail to utilize the full set of relevant factors present in the cases. Critically, on the abstention test (Test 3), most models failed to follow instructions to stop, instead generating spurious arguments despite the lack of common factors. This automated pipeline provides a scalable method for assessing these crucial LLM behaviors, highlighting the need for improvements in factor utilization and robust abstention capabilities before reliable deployment in legal settings. Link: https://lizhang-aiandlaw.github.io/An-Automated-Pipeline-for-Evaluating-LLM-Generated-3-ply-Case-Based-Legal-Arguments/

Authors:Yufa Zhou, Shaobo Wang, Xingyu Dong, Xiangqi Jin, Yifang Chen, Yue Min, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
Title: Reasoning Like an Economist: Post-Training on Economic Problems Induces Strategic Generalization in LLMs
Abstract:
Directly training Large Language Models (LLMs) for Multi-Agent Systems (MAS) remains challenging due to intricate reward modeling, dynamic agent interactions, and demanding generalization requirements. This paper explores whether post-training techniques, specifically Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), can effectively $\textit{generalize}$ to multi-agent scenarios. We use economic reasoning as a testbed, leveraging its strong foundations in mathematics and game theory, its demand for structured analytical reasoning, and its relevance to real-world applications such as market design, resource allocation, and policy analysis. We introduce $\textbf{Recon}$ ($\textbf{R}$easoning like an $\textbf{ECON}$omist), a 7B-parameter open-source LLM post-trained on a hand-curated dataset of 2,100 high-quality economic reasoning problems. Comprehensive evaluation on economic reasoning benchmarks and multi-agent games reveals clear improvements in structured reasoning and economic rationality. These results underscore the promise of domain-aligned post-training for enhancing reasoning and agent alignment, shedding light on the roles of SFT and RL in shaping model behavior. Code is available at https://github.com/MasterZhou1/Recon .
中文: 本文介绍了Recon,一个通过监督微调和可验证奖励强化学习在经济学推理问题上进行后训练的70亿参数大语言模型,在多智能体场景中展现出增强的结构化推理和经济理性能力。
English: This paper introduces Recon, a 7B-parameter LLM post-trained using SFT and RLVR techniques on economic reasoning problems, demonstrating improved structured reasoning and economic rationality in multi-agent scenarios.

Authors:Ming Wang, Peidong Wang, Lin Wu, Xiaocui Yang, Daling Wang, Shi Feng, Yuxin Chen, Bixuan Wang, Yifei Zhang
Title: AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation
Abstract:
Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers' mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator's configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on https://github.com/sci-m-wang/AnnaAgent.
中文:研究人员开发了AnnaAgent,这是一种具备动态情绪控制和多会话记忆功能的先进对话代理,通过整合真实心理咨询数据克服了模拟心理健康求助者的局限性,并在评估中展现出更优越的性能。
English: Researchers have developed AnnaAgent, an advanced conversational agent with dynamic emotional control and multi-session memory, to overcome limitations in simulating realistic mental health seekers by incorporating real counseling data and achieving superior performance in evaluations.

Authors:Hyangsuk Min, Yuho Lee, Minjeong Ban, Jiaqi Deng, Nicole Hee-Yeon Kim, Taewon Yun, Hang Su, Jason Cai, Hwanjun Song
Title: Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages
Abstract:
Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their assessment of self-generated summaries. Our benchmark dataset is publicly available at https://github.com/DISL-Lab/MSumBench.
中文:该摘要介绍了MSumBench,一个针对英文和中文文本摘要的多维、多领域评估基准,它通过引入专业评估标准和多智能体辩论系统解决了领域特定标准缺失和人工标注的难题,揭示了不同模型的表现模式以及大型语言模型作为评估者时的系统性偏见。
English: This abstract introduces MSumBench, a multi-dimensional and multi-domain evaluation benchmark for text summarization in English and Chinese, which addresses gaps in domain-specific criteria and human annotation challenges by incorporating specialized assessments and a multi-agent debate system, revealing distinct model performance patterns and biases in large language models as evaluators.

Authors:Changyue Wang, Weihang Su, Qingyao Ai, Yujia Zhou, Yiqun Liu
Title: Decoupling Reasoning and Knowledge Injection for In-Context Knowledge Editing
Abstract:
Knowledge editing aims to efficiently update Large Language Models (LLMs) by modifying specific knowledge without retraining the entire model. Among knowledge editing approaches, in-context editing (ICE) offers a lightweight solution by injecting new knowledge directly into the input context, leaving model parameters unchanged. However, existing ICE approaches do not explicitly separate the newly injected knowledge from the model's original reasoning process. This entanglement often results in conflicts between external updates and internal parametric knowledge, undermining the consistency and accuracy of the reasoning path.In this work, we conduct preliminary experiments to examine how parametric knowledge influences reasoning path planning. We find that the model's reasoning is tightly coupled with its internal knowledge, and that naively injecting new information without adapting the reasoning path often leads to performance degradation, particularly in multi-hop tasks. To this end, we propose DecKER, a novel ICE framework that decouples reasoning from knowledge editing by generating a masked reasoning path and then resolving knowledge edits via hybrid retrieval and model-based validation. Experiments on multi-hop QA benchmarks show that DecKER significantly outperforms existing ICE methods by mitigating knowledge conflicts and preserving reasoning consistency. Our code is available at: https://github.com/bebr2/DecKER .
中文: 知识编辑中的上下文编辑方法存在推理与知识纠缠的问题,DecKER通过解耦推理路径并采用混合验证,有效缓解知识冲突,在多跳任务中显著提升了推理一致性和准确性。
English: Knowledge editing in LLMs, particularly in-context editing (ICE), faces challenges from entangled reasoning and knowledge, which DecKER addresses by decoupling reasoning paths and using hybrid validation to enhance consistency and accuracy in multi-hop tasks.

Authors:Tianhui Liu, Jie Feng, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Yong Li
Title: CityLens: Benchmarking Large Language-Vision Models for Urban Socioeconomic Sensing
Abstract:
Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce $\textbf{CityLens}$, a comprehensive benchmark designed to evaluate the capabilities of large language-vision models (LLVMs) in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize three evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LLVMs across these tasks. Our results reveal that while LLVMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LLVMs to understand and predict urban socioeconomic patterns. Our codes and datasets are open-sourced via https://github.com/tsinghua-fib-lab/CityLens.
中文摘要:CityLens是一个通过卫星和街景图像评估大语言视觉模型预测城市社会经济指标能力的综合基准,既揭示了现有模型的局限性,也为未来发展提供了统一框架。
English Summary: CityLens is a comprehensive benchmark that evaluates large language-vision models' ability to predict urban socioeconomic indicators from visual data, revealing their current limitations while providing a framework for future improvements.

Authors:Runtao Ren, Jian Ma, Jianxi Luo
Title: Retrieval-Augmented Generation Systems for Intellectual Property via Synthetic Multi-Angle Fine-tuning
Abstract:
Retrieval-Augmented Generation (RAG) systems in the Intellectual Property (IP) field often struggle with diverse user queries, including colloquial expressions, spelling errors, and ambiguous terminology, leading to inaccurate retrieval and suboptimal responses. To address this challenge, we propose Multi-Angle Question Generation and Retrieval Fine-Tuning Method (MQG-RFM), a novel framework that leverages large language models (LLMs) to simulate varied user inquiries and fine-tunes retrieval models to align semantically equivalent but linguistically diverse questions. Unlike complex architectural modifications, MQG-RFM adopts a lightweight Data-to-Tune paradigm, combining prompt-engineered query generation with hard negative mining to enhance retrieval robustness without costly infrastructure changes. Experimental results on a Taiwan patent Q&A dataset show 185.62% improvement in retrieval accuracy on the Patent Consultation dataset and 262.26% improvement on the Novel Patent Technology Report dataset, with 14.22% and 53.58% improvements in generation quality over the baselines, respectively. By bridging the gap between user intent and system comprehension through semantic-aware retrieval optimization, MQG-RFM offers a practical, scalable approach for rapid, cost-effective deployment among small and medium-sized agencies seeking reliable patent intelligence solutions. Additionally, our proposed method has already been adopted by ScholarMate, the largest professional research social networking platform in China, to support real-world development and deployment. A demo version of the instantiated is available at https://github.com/renruntao/patent_rag.
Chinese: MQG-RFM框架通过利用大语言模型模拟多样化用户查询并微调检索模型,显著提升了知识产权领域检索增强生成的准确性和生成质量,无需复杂架构改动即可实现高效部署。
English: The MQG-RFM framework enhances retrieval-augmented generation in intellectual property by using large language models to simulate diverse user queries and fine-tune retrieval models, achieving significant improvements in accuracy and generation quality without complex architectural changes.

Authors:Yuxi Sun, Aoqi Zuo, Wei Gao, Jing Ma
Title: CausalAbstain: Enhancing Multilingual LLMs with Causal Reasoning for Trustworthy Abstention
Abstract:
Large Language Models (LLMs) often exhibit knowledge disparities across languages. Encouraging LLMs to \textit{abstain} when faced with knowledge gaps is a promising strategy to reduce hallucinations in multilingual settings. Current abstention strategies for multilingual scenarios primarily rely on generating feedback in various languages using LLMs and performing self-reflection. However, these methods can be adversely impacted by inaccuracies and biases in the generated feedback. To address this, from a causal perspective, we introduce \textit{CausalAbstain}, a method that helps LLMs determine whether to utilize multiple generated feedback responses and how to identify the most useful ones. Extensive experiments demonstrate that \textit{CausalAbstain} effectively selects helpful feedback and enhances abstention decisions with interpretability in both native language (\textsc{Casual-native}) and multilingual (\textsc{Causal-multi}) settings, outperforming strong baselines on two benchmark datasets covering encyclopedic and commonsense knowledge QA tasks. Our code and data are open-sourced at https://github.com/peachch/CausalAbstain.
Chinese: 为解决多语言大语言模型中的知识差异导致的幻觉问题,CausalAbstain方法通过因果分析筛选有效反馈以优化弃权决策,在单语和多语言场景下均展现出优于基准方法的性能表现。
English: To mitigate hallucinations caused by knowledge gaps in multilingual Large Language Models, the proposed CausalAbstain method employs causal analysis to select useful feedback for improving abstention decisions, demonstrating superior performance in both native and multilingual settings.

Authors:Zherui Li, Yan Mi, Zhenhong Zhou, Houcheng Jiang, Guibin Zhang, Kun Wang, Junfeng Fang
Title: Goal-Aware Identification and Rectification of Misinformation in Multi-Agent Systems
Abstract:
Large Language Model-based Multi-Agent Systems (MASs) have demonstrated strong advantages in addressing complex real-world tasks. However, due to the introduction of additional attack surfaces, MASs are particularly vulnerable to misinformation injection. To facilitate a deeper understanding of misinformation propagation dynamics within these systems, we introduce MisinfoTask, a novel dataset featuring complex, realistic tasks designed to evaluate MAS robustness against such threats. Building upon this, we propose ARGUS, a two-stage, training-free defense framework leveraging goal-aware reasoning for precise misinformation rectification within information flows. Our experiments demonstrate that in challenging misinformation scenarios, ARGUS exhibits significant efficacy across various injection attacks, achieving an average reduction in misinformation toxicity of approximately 28.17% and improving task success rates under attack by approximately 10.33%. Our code and dataset is available at: https://github.com/zhrli324/ARGUS.
中文:基于大语言模型的多智能体系统易受虚假信息攻击,而提出的ARGUS防御框架能显著降低28.17%的毒性并提升10.33%的任务成功率。
English: Large Language Model-based Multi-Agent Systems are vulnerable to misinformation, but the proposed ARGUS defense framework effectively reduces toxicity by 28.17% and improves task success rates by 10.33%.

Authors:Dohyun Lee, Seungil Chad Lee, Chanwoo Yang, Yujin Baek, Jaegul Choo
Title: Exploring In-context Example Generation for Machine Translation
Abstract:
Large language models (LLMs) have demonstrated strong performance across various tasks, leveraging their exceptional in-context learning ability with only a few examples. Accordingly, the selection of optimal in-context examples has been actively studied in the field of machine translation. However, these studies presuppose the presence of a demonstration pool with human-annotated pairs, making them less applicable to low-resource languages where such an assumption is challenging to meet. To overcome this limitation, this paper explores the research direction of in-context example generation for machine translation. Specifically, we propose Demonstration Augmentation for Translation (DAT), a simple yet effective approach that generates example pairs without relying on any external resources. This method builds upon two prior criteria, relevance and diversity, which have been highlighted in previous work as key factors for in-context example selection. Through experiments and analysis on low-resource languages where human-annotated pairs are scarce, we show that DAT achieves superior translation quality compared to the baselines. Furthermore, we investigate the potential of progressively accumulating generated pairs during test time to build and reuse a demonstration pool. Our implementation is publicly available at https://github.com/aiclaudev/DAT.
大语言模型在机器翻译中通过上下文学习表现出色,但其依赖人工标注示例对的特性限制了在低资源语言中的应用,因此本研究提出DAT方法,无需外部资源即可生成相关且多样的示例对,从而提升翻译质量。
Large language models excel in machine translation through in-context learning, but their reliance on human-annotated example pairs limits their use for low-resource languages, prompting this study to propose DAT, a method that generates relevant and diverse examples without external resources to enhance translation quality.

Authors:Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, Yohan Jo
Title: PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings
Abstract:
Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.
中文摘要:本研究发布个性化视觉说服(PVP)数据集,填补图像说服力与评估者个人信息关联数据的空白,实验证明结合心理特征能有效提升人工智能生成说服性图像的效果。
English Summary: The study introduces the Personalized Visual Persuasion (PVP) dataset to address the lack of comprehensive data linking image persuasiveness with evaluators' personal information, demonstrating that incorporating psychological traits improves AI-generated persuasive images.

Authors:Ioan-Paul Ciobanu, Andrei-Iulian Hiji, Nicolae-Catalin Ristea, Paul Irofti, Cristian Rusu, Radu Tudor Ionescu
Title: XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark
Abstract:
Recent advances in audio generation led to an increasing number of deepfakes, making the general public more vulnerable to financial scams, identity theft, and misinformation. Audio deepfake detectors promise to alleviate this issue, with many recent studies reporting accuracy rates close to 99%. However, these methods are typically tested in an in-domain setup, where the deepfake samples from the training and test sets are produced by the same generative models. To this end, we introduce XMAD-Bench, a large-scale cross-domain multilingual audio deepfake benchmark comprising 668.8 hours of real and deepfake speech. In our novel dataset, the speakers, the generative methods, and the real audio sources are distinct across training and test splits. This leads to a challenging cross-domain evaluation setup, where audio deepfake detectors can be tested ``in the wild''. Our in-domain and cross-domain experiments indicate a clear disparity between the in-domain performance of deepfake detectors, which is usually as high as 100%, and the cross-domain performance of the same models, which is sometimes similar to random chance. Our benchmark highlights the need for the development of robust audio deepfake detectors, which maintain their generalization capacity across different languages, speakers, generative methods, and data sources. Our benchmark is publicly released at https://github.com/ristea/xmad-bench/.
中文:当前音频伪造检测器在受控环境中准确率接近完美,但在跨领域场景下表现不佳,凸显了开发能跨语言、说话人和生成方法泛化的鲁棒模型的必要性。
English: Recent audio deepfake detectors achieve near-perfect accuracy in controlled settings but perform poorly in cross-domain scenarios, highlighting the need for more robust models that generalize across languages, speakers, and generation methods.

Authors:Suhas BN, Han-Chin Shing, Lei Xu, Mitch Strong, Jon Burnsky, Jessica Ofor, Jordan R. Mason, Susan Chen, Sundararajan Srinivasan, Chaitanya Shivade, Jack Moriarty, Joseph Paul Cohen
Title: Fact-Controlled Diagnosis of Hallucinations in Medical Text Summarization
Abstract:
Hallucinations in large language models (LLMs) during summarization of patient-clinician dialogues pose significant risks to patient care and clinical decision-making. However, the phenomenon remains understudied in the clinical domain, with uncertainty surrounding the applicability of general-domain hallucination detectors. The rarity and randomness of hallucinations further complicate their investigation. In this paper, we conduct an evaluation of hallucination detection methods in the medical domain, and construct two datasets for the purpose: A fact-controlled Leave-N-out dataset -- generated by systematically removing facts from source dialogues to induce hallucinated content in summaries; and a natural hallucination dataset -- arising organically during LLM-based medical summarization. We show that general-domain detectors struggle to detect clinical hallucinations, and that performance on fact-controlled hallucinations does not reliably predict effectiveness on natural hallucinations. We then develop fact-based approaches that count hallucinations, offering explainability not available with existing methods. Notably, our LLM-based detectors, which we developed using fact-controlled hallucinations, generalize well to detecting real-world clinical hallucinations. This research contributes a suite of specialized metrics supported by expert-annotated datasets to advance faithful clinical summarization systems.
中文: 大语言模型在临床对话摘要中易产生幻觉,通用检测器效果不佳,为此开发了基于事实的专业化可解释检测方法,能有效识别真实医疗场景中的错误信息。
English: Large language models face challenges with hallucinations in clinical dialogue summarization, where general detectors prove inadequate, prompting the development of specialized, explainable metrics that effectively identify real-world medical inaccuracies.

Authors:Siavash Shams, Richard Antonello, Gavin Mischler, Stephan Bickel, Ashesh Mehta, Nima Mesgarani
Title: Neuro2Semantic: A Transfer Learning Framework for Semantic Reconstruction of Continuous Language from Human Intracranial EEG
Abstract:
Decoding continuous language from neural signals remains a significant challenge in the intersection of neuroscience and artificial intelligence. We introduce Neuro2Semantic, a novel framework that reconstructs the semantic content of perceived speech from intracranial EEG (iEEG) recordings. Our approach consists of two phases: first, an LSTM-based adapter aligns neural signals with pre-trained text embeddings; second, a corrector module generates continuous, natural text directly from these aligned embeddings. This flexible method overcomes the limitations of previous decoding approaches and enables unconstrained text generation. Neuro2Semantic achieves strong performance with as little as 30 minutes of neural data, outperforming a recent state-of-the-art method in low-data settings. These results highlight the potential for practical applications in brain-computer interfaces and neural decoding technologies.
Chinese: Neuro2Semantic是一种创新框架,通过两阶段方法从颅内脑电图记录中重建感知语音的语义内容,实现了无约束的文本生成,并在少量神经数据下展现出优异性能。
English: Neuro2Semantic is an innovative framework that reconstructs the semantic content of perceived speech from intracranial EEG recordings using a two-phase approach, enabling unconstrained text generation and achieving strong performance with minimal neural data.

Authors:Yubai Wei, Jiale Han, Yi Yang
Title: Adapting General-Purpose Embedding Models to Private Datasets Using Keyword-based Retrieval
Abstract:
Text embedding models play a cornerstone role in AI applications, such as retrieval-augmented generation (RAG). While general-purpose text embedding models demonstrate strong performance on generic retrieval benchmarks, their effectiveness diminishes when applied to private datasets (e.g., company-specific proprietary data), which often contain specialized terminology and lingo. In this work, we introduce BMEmbed, a novel method for adapting general-purpose text embedding models to private datasets. By leveraging the well-established keyword-based retrieval technique (BM25), we construct supervisory signals from the ranking of keyword-based retrieval results to facilitate model adaptation. We evaluate BMEmbed across a range of domains, datasets, and models, showing consistent improvements in retrieval performance. Moreover, we provide empirical insights into how BM25-based signals contribute to improving embeddings by fostering alignment and uniformity, highlighting the value of this approach in adapting models to domain-specific data. We release the source code available at https://github.com/BaileyWei/BMEmbed for the research community.
中文摘要:BMEmbed通过利用BM25检索技术构建监督信号,有效提升通用文本嵌入模型在私有数据集上的检索性能,实现了跨领域的一致改进。
English Summary: BMEmbed enhances general-purpose text embedding models for private datasets by using BM25-based retrieval signals to improve domain-specific performance, consistently boosting retrieval effectiveness across various domains.

Authors:Dang Nguyen, Ali Payani, Baharan Mirzasoleiman
Title: Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity
Abstract:
Hallucination in large language models (LLMs) can be detected by assessing the uncertainty of model outputs, typically measured using entropy. Semantic entropy (SE) enhances traditional entropy estimation by quantifying uncertainty at the semantic cluster level. However, as modern LLMs generate longer one-sentence responses, SE becomes less effective because it overlooks two crucial factors: intra-cluster similarity (the spread within a cluster) and inter-cluster similarity (the distance between clusters). To address these limitations, we propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy. Our approach can also be easily extended to white-box settings by incorporating token probabilities. Additionally, we provide theoretical results showing that our method generalizes semantic entropy. Extensive empirical results demonstrate its effectiveness compared to semantic entropy across two recent LLMs (Phi3 and Llama3) and three common text generation tasks: question answering, text summarization, and machine translation. Our code is available at https://github.com/BigML-CS-UCLA/SNNE.
中文摘要:本文提出了一种新的黑盒不确定性量化方法,通过考虑簇内和簇间相似性改进了语义熵,在多种大语言模型和文本生成任务中展现出更优性能。
English Summary: This paper introduces a new black-box uncertainty quantification method that improves upon semantic entropy by accounting for intra-cluster and inter-cluster similarities, demonstrating superior performance across multiple LLMs and text generation tasks.

Authors:Linyuan Gong, Alvin Cheung, Mostafa Elhoushi, Sida Wang
Title: Structure-Aware Fill-in-the-Middle Pretraining for Code
Abstract:
Fill-in-the-Middle (FIM) is a common pretraining method for code LLMs, where models complete code segments given surrounding context. However, existing LLMs treat code as plain text and mask random character spans. We propose and evaluate AST-FIM, a pretraining strategy that leverages Abstract Syntax Trees (ASTs) to mask complete syntactic structures at scale, ensuring coherent training examples better aligned with universal code structures and common code editing patterns such as blocks, expressions, or functions. To evaluate real-world fill-in-the-middle (FIM) programming tasks, we introduce Real-FIM-Eval, a benchmark derived from 30,000+ GitHub commits across 12 languages. On infilling tasks, experiments on 1B and 8B parameter models show that AST-FIM is particularly beneficial for real-world code editing as it outperforms standard random-character FIM by up to 5 pts on standard FIM benchmarks. Our code is publicly available at https://github.com/gonglinyuan/ast_fim.
中文: AST-FIM是一种利用抽象语法树掩码完整语法结构的新型预训练方法,在代码填充任务中比标准随机字符掩码性能提升高达5个百分点,更贴合实际代码编辑模式。
English: AST-FIM is a novel pretraining method that uses Abstract Syntax Trees to mask complete syntactic structures, outperforming standard random-character masking by up to 5 points on fill-in-the-middle benchmarks and better aligning with real-world code editing patterns.

Authors:Yaxin Luo, Zhaoyi Li, Jiacheng Liu, Jiacheng Cui, Xiaohan Zhao, Zhiqiang Shen
Title: Open CaptchaWorld: A Comprehensive Web-based Platform for Testing and Benchmarking Multimodal LLM Agents
Abstract:
CAPTCHAs have been a critical bottleneck for deploying web agents in real-world applications, often blocking them from completing end-to-end automation tasks. While modern multimodal LLM agents have demonstrated impressive performance in static perception tasks, their ability to handle interactive, multi-step reasoning challenges like CAPTCHAs is largely untested. To address this gap, we introduce Open CaptchaWorld, the first web-based benchmark and platform specifically designed to evaluate the visual reasoning and interaction capabilities of MLLM-powered agents through diverse and dynamic CAPTCHA puzzles. Our benchmark spans 20 modern CAPTCHA types, totaling 225 CAPTCHAs, annotated with a new metric we propose: CAPTCHA Reasoning Depth, which quantifies the number of cognitive and motor steps required to solve each puzzle. Experimental results show that humans consistently achieve near-perfect scores, state-of-the-art MLLM agents struggle significantly, with success rates at most 40.0% by Browser-Use Openai-o3, far below human-level performance, 93.3%. This highlights Open CaptchaWorld as a vital benchmark for diagnosing the limits of current multimodal agents and guiding the development of more robust multimodal reasoning systems. Code and Data are available at this https URL.
Chinese Summary: Open CaptchaWorld 是首个基于网络的基准测试平台,用于评估多模态大语言模型代理解决各类验证码的能力,实验显示人类表现接近完美,而顶尖模型成功率远低于人类水平,揭示了当前交互推理能力的重大不足。
English Summary: Open CaptchaWorld is introduced as the first web-based benchmark to test multimodal LLM agents' capabilities in solving diverse CAPTCHA puzzles, revealing that while humans achieve near-perfect scores, current state-of-the-art agents significantly underperform, highlighting critical gaps in interactive reasoning.

Authors:Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, Philip Torr, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan
Title: Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks
Abstract:
Deep reasoning is fundamental for solving complex tasks, especially in vision-centric scenarios that demand sequential, multimodal understanding. However, existing benchmarks typically evaluate agents with fully synthetic, single-turn queries, limited visual modalities, and lack a framework to assess reasoning quality over multiple steps as required in real-world settings. To address this, we introduce Agent-X, a large-scale benchmark for evaluating vision-centric agents multi-step and deep reasoning capabilities in real-world, multimodal settings. Agent- X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text. These tasks span six major agentic environments: general visual reasoning, web browsing, security and surveillance, autonomous driving, sports, and math reasoning. Our benchmark requires agents to integrate tool use with explicit, stepwise decision-making in these diverse settings. In addition, we propose a fine-grained, step-level evaluation framework that assesses the correctness and logical coherence of each reasoning step and the effectiveness of tool usage throughout the task. Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks, achieving less than 50% full-chain success. These findings highlight key bottlenecks in current LMM reasoning and tool-use capabilities and identify future research directions in vision-centric agentic reasoning models. Our data and code are publicly available at https://github.com/mbzuai-oryx/Agent-X
中文摘要:我们推出了Agent-X基准测试,用于评估视觉中心智能体在真实多模态环境中的多步推理和工具使用能力,结果表明即使GPT、Gemini等顶尖模型在复杂任务中表现不佳,完整任务成功率不足50%。
English Summary: The Agent-X benchmark is introduced to evaluate vision-centric agents' multi-step reasoning and tool-use capabilities in real-world multimodal settings, revealing that top models like GPT and Gemini struggle with complex tasks, achieving under 50% success rates.

Authors:Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, Vicente Ordonez
Title: ProxyThinker: Test-Time Guidance through Small Visual Reasoners
Abstract:
Recent advancements in reinforcement learning with verifiable rewards have pushed the boundaries of the visual reasoning capabilities in large vision-language models (LVLMs). However, training LVLMs with reinforcement fine-tuning (RFT) is computationally expensive, posing a significant challenge to scaling model size. In this work, we propose ProxyThinker, an inference-time technique that enables large models to inherit the visual reasoning capabilities from small, slow-thinking visual reasoners without any training. By subtracting the output distributions of base models from those of RFT reasoners, ProxyThinker modifies the decoding dynamics and successfully elicits the slow-thinking reasoning demonstrated by the emerged sophisticated behaviors such as self-verification and self-correction. ProxyThinker consistently boosts performance on challenging visual benchmarks on spatial, mathematical, and multi-disciplinary reasoning, enabling untuned base models to compete with the performance of their full-scale RFT counterparts. Furthermore, our implementation efficiently coordinates multiple language models with parallelism techniques and achieves up to 38 $\times$ faster inference compared to previous decoding-time methods, paving the way for the practical deployment of ProxyThinker. Code is available at https://github.com/MrZilinXiao/ProxyThinker.
中文: ProxyThinker是一种无需训练的推理时技术,通过调整解码动态使大型视觉语言模型能够继承小型慢思考推理器的视觉推理能力,在多项复杂视觉基准测试中显著提升性能,同时实现高达38倍的推理加速。
English: ProxyThinker is an inference-time technique that enables large vision-language models to acquire enhanced visual reasoning capabilities from smaller, specialized reasoners without additional training, significantly improving performance on complex visual benchmarks while accelerating inference speeds by up to 38 times.

Authors:Shuyao Xu, Cheng Peng, Jiangxuan Long, Weidi Xu, Wei Chu, Yuan Qi
Title: Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Abstract:
Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.
中文: 本文提出强化蒸馏(REDI)框架,通过两阶段学习同时利用正负推理轨迹来增强模型推理能力,在数学任务上使用公开数据训练的1.5B模型取得了最优性能。
English: This paper introduces Reinforcement Distillation (REDI), a two-stage framework that leverages both positive and negative reasoning traces to enhance model reasoning, achieving state-of-the-art performance for 1.5B models on mathematical tasks with openly available data.

Authors:Wanyun Xie, Francesco Tonin, Volkan Cevher
Title: Chameleon: A Flexible Data-mixing Framework for Language Model Pretraining and Finetuning
Abstract:
Training data mixtures greatly impact the generalization performance of large language models. Existing domain reweighting methods often rely on costly weight computations and require retraining when new data is introduced. To this end, we introduce a flexible and efficient data mixing framework, Chameleon, that employs leverage scores to quantify domain importance within a learned embedding space. We first construct a domain affinity matrix over domain embeddings. The induced leverage scores determine a mixture that upweights domains sharing common representations in embedding space. This formulation allows direct transfer to new data by computing the new domain embeddings. In experiments, we demonstrate improvements over three key scenarios: (i) our computed weights improve performance on pretraining domains with a fraction of the compute of existing methods; (ii) Chameleon can adapt to data changes without proxy retraining, boosting few-shot reasoning accuracies when transferred to new data; (iii) our method enables efficient domain reweighting in finetuning, consistently improving test perplexity on all finetuning domains over uniform mixture. Our code is available at https://github.com/LIONS-EPFL/Chameleon.
Chinese: Chameleon框架通过利用嵌入空间中的杠杆分数动态调整训练领域权重,无需昂贵重训练即可适应新数据,在预训练、迁移学习和微调场景中持续提升语言模型性能。
English: The Chameleon framework efficiently improves language model performance by using leverage scores to dynamically reweight training domains in embedding space, enabling seamless adaptation to new data without costly retraining across pretraining, transfer learning, and fine-tuning scenarios.

Authors:Li yunhan, Wu gengshen
Title: LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Abstract:
As large language models (LLMs) are increasingly used in legal applications, current evaluation benchmarks tend to focus mainly on factual accuracy while largely neglecting important linguistic quality aspects such as clarity, coherence, and terminology. To address this gap, we propose three steps: First, we develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology. Second, we create a specialized set of legal questions. Third, we analyze 49 LLMs using this evaluation framework. Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7\%$ noted at 72 billion parameters. Second, engineering choices such as quantization and context length have a negligible impact, as indicated by statistical significance thresholds above 0.016. Third, reasoning models consistently outperform base architectures. A significant outcome of our research is the release of a ranking list and Pareto analysis, which highlight the Qwen3 series as the optimal choice for cost-performance tradeoffs. This work not only establishes standardized evaluation protocols for legal LLMs but also uncovers fundamental limitations in current training data refinement approaches. Code and models are available at: https://github.com/lyxx3rd/LegalEval-Q.
中文: 本研究通过开发回归模型、构建专业法律问题集并分析49个大型语言模型,填补了法律大模型语言质量评估的空白,发现模型性能在140亿参数时趋于稳定,推理模型优于基础架构,且Qwen3系列在成本效益方面表现最佳。
English: This study addresses the gap in evaluating linguistic quality of legal LLMs by developing a regression model, specialized legal questions, and analyzing 49 models, revealing that performance plateaus at 14B parameters and reasoning models outperform base architectures, with the Qwen3 series identified as optimal for cost-performance balance.

Authors:Yucheng Zhou, Jiahao Yuan, Qianning Wang
Title: Draw ALL Your Imagine: A Holistic Benchmark and Agent Framework for Complex Instruction-based Image Generation
Abstract:
Recent advancements in text-to-image (T2I) generation have enabled models to produce high-quality images from textual descriptions. However, these models often struggle with complex instructions involving multiple objects, attributes, and spatial relationships. Existing benchmarks for evaluating T2I models primarily focus on general text-image alignment and fail to capture the nuanced requirements of complex, multi-faceted prompts. Given this gap, we introduce LongBench-T2I, a comprehensive benchmark specifically designed to evaluate T2I models under complex instructions. LongBench-T2I consists of 500 intricately designed prompts spanning nine diverse visual evaluation dimensions, enabling a thorough assessment of a model's ability to follow complex instructions. Beyond benchmarking, we propose an agent framework (Plan2Gen) that facilitates complex instruction-driven image generation without requiring additional model training. This framework integrates seamlessly with existing T2I models, using large language models to interpret and decompose complex prompts, thereby guiding the generation process more effectively. As existing evaluation metrics, such as CLIPScore, fail to adequately capture the nuances of complex instructions, we introduce an evaluation toolkit that automates the quality assessment of generated images using a set of multi-dimensional metrics. The data and code are released at https://github.com/yczhou001/LongBench-T2I.
中文: 针对文本到图像模型处理复杂指令的不足,我们推出了LongBench-T2I基准测试和Plan2Gen代理框架,无需额外训练即可提升生成与评估能力。
English: Recent text-to-image models face challenges with complex prompts, prompting the introduction of LongBench-T2I, a comprehensive benchmark and Plan2Gen agent framework to enhance generation and evaluation without additional training.

Authors:Jiayu Liu, Qing Zong, Weiqi Wang, Yangqiu Song
Title: Revisiting Epistemic Markers in Confidence Estimation: Can Markers Accurately Reflect Large Language Models' Uncertainty?
Abstract:
As large language models (LLMs) are increasingly used in high-stakes domains, accurately assessing their confidence is crucial. Humans typically express confidence through epistemic markers (e.g., "fairly confident") instead of numerical values. However, it remains unclear whether LLMs consistently use these markers to reflect their intrinsic confidence due to the difficulty of quantifying uncertainty associated with various markers. To address this gap, we first define marker confidence as the observed accuracy when a model employs an epistemic marker. We evaluate its stability across multiple question-answering datasets in both in-distribution and out-of-distribution settings for open-source and proprietary LLMs. Our results show that while markers generalize well within the same distribution, their confidence is inconsistent in out-of-distribution scenarios. These findings raise significant concerns about the reliability of epistemic markers for confidence estimation, underscoring the need for improved alignment between marker based confidence and actual model uncertainty. Our code is available at https://github.com/HKUST-KnowComp/MarCon.
中文摘要:本研究评估了认知标记词在反映大语言模型内在置信度方面的可靠性,发现标记词在同一数据分布内能保持稳定的准确性,但在分布外场景中表现出不稳定性,这对基于标记词的置信度评估的可信度提出了重要关切。
English Summary: This study evaluates the reliability of epistemic markers in reflecting large language models' intrinsic confidence, finding that while markers maintain consistent accuracy within the same data distribution, they exhibit instability in out-of-distribution scenarios, highlighting concerns about their trustworthiness for confidence estimation.

Authors:Zafir Stojanovski, Oliver Stanley, Joe Sharratt, Richard Jones, Abdulhakeem Adefioye, Jean Kaddour, Andreas Köpf
Title: REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards
Abstract:
We introduce Reasoning Gym (RG), a library of reasoning environments for reinforcement learning with verifiable rewards. It provides over 100 data generators and verifiers spanning multiple domains including algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and various common games. Its key innovation is the ability to generate virtually infinite training data with adjustable complexity, unlike most previous reasoning datasets, which are typically fixed. This procedural generation approach allows for continuous evaluation across varying difficulty levels. Our experimental results demonstrate the efficacy of RG in both evaluating and reinforcement learning of reasoning models.
中文: Reasoning Gym是一个强化学习推理环境库,通过100多个可验证奖励的领域生成器实现无限可调难度的训练数据,有效支持推理模型的评估与训练。
English: Reasoning Gym is a reinforcement learning library featuring over 100 procedurally generated environments with verifiable rewards across multiple domains, enabling infinite adjustable-difficulty training data for effective model evaluation and training.

Authors:Yingchaojie Feng, Yiqun Sun, Yandong Sun, Minfeng Zhu, Qiang Huang, Anthony K. H. Tung, Wei Chen
Title: Don't Reinvent the Wheel: Efficient Instruction-Following Text Embedding based on Guided Space Transformation
Abstract:
In this work, we investigate an important task named instruction-following text embedding, which generates dynamic text embeddings that adapt to user instructions, highlighting specific attributes of text. Despite recent advancements, existing approaches suffer from significant computational overhead, as they require re-encoding the entire corpus for each new instruction. To address this challenge, we propose GSTransform, a novel instruction-following text embedding framework based on Guided Space Transformation. Our key observation is that instruction-relevant information is inherently encoded in generic embeddings but remains underutilized. Instead of repeatedly encoding the corpus for each instruction, GSTransform is a lightweight transformation mechanism that adapts pre-computed embeddings in real time to align with user instructions, guided by a small amount of text data with instruction-focused label annotation. We conduct extensive experiments on three instruction-awareness downstream tasks across nine real-world datasets, demonstrating that GSTransform improves instruction-following text embedding quality over state-of-the-art methods while achieving dramatic speedups of 6~300x in real-time processing on large-scale datasets. The source code is available at https://github.com/YingchaojieFeng/GSTransform.
中文: 本文提出GSTransform框架,通过引导空间变换将预计算文本嵌入动态适配用户指令,在提升嵌入质量的同时实现比现有方法快6-300倍的实时处理速度。
English: This paper introduces GSTransform, a lightweight framework that dynamically adapts pre-computed text embeddings to user instructions through guided space transformation, achieving superior embedding quality and 6-300x faster processing than existing methods.

Authors:Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, Yike Guo
Title: FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Abstract:
Multimodal Large Language Models (MLLMs) have experienced rapid development in recent years. However, in the financial domain, there is a notable lack of effective and specialized multimodal evaluation datasets. To advance the development of MLLMs in the finance domain, we introduce FinMME, encompassing more than 11,000 high-quality financial research samples across 18 financial domains and 6 asset classes, featuring 10 major chart types and 21 subtypes. We ensure data quality through 20 annotators and carefully designed validation mechanisms. Additionally, we develop FinScore, an evaluation system incorporating hallucination penalties and multi-dimensional capability assessment to provide an unbiased evaluation. Extensive experimental results demonstrate that even state-of-the-art models like GPT-4o exhibit unsatisfactory performance on FinMME, highlighting its challenging nature. The benchmark exhibits high robustness with prediction variations under different prompts remaining below 1%, demonstrating superior reliability compared to existing datasets. Our dataset and evaluation protocol are available at https://huggingface.co/datasets/luojunyu/FinMME and https://github.com/luo-junyu/FinMME.
中文: 针对多模态大语言模型在金融领域缺乏专业评估数据集的问题,我们开发了包含逾1.1万样本的FinMME数据集及FinScore评估体系,实验表明即使GPT-4o等顶尖模型在该高鲁棒性基准上也表现不佳。
English: Multimodal Large Language Models (MLLMs) lack specialized financial evaluation datasets, prompting the creation of FinMME with over 11,000 samples and FinScore for unbiased assessment, revealing even top models like GPT-4o struggle on this robust benchmark.

Authors:Sander Land, Catherine Arnett
Title: BPE Stays on SCRIPT: Structured Encoding for Robust Multilingual Pretokenization
Abstract:
Byte Pair Encoding (BPE) tokenizers, widely used in Large Language Models, face challenges in multilingual settings, including penalization of non-Western scripts and the creation of tokens with partial UTF-8 sequences. Pretokenization, often reliant on complex regular expressions, can also introduce fragility and unexpected edge cases. We propose SCRIPT (Script Category Representation in PreTokenization), a novel encoding scheme that bypasses UTF-8 byte conversion by using initial tokens based on Unicode script and category properties. This approach enables a simple, rule-based pretokenization strategy that respects script boundaries, offering a robust alternative to pretokenization strategies based on regular expressions. We also introduce and validate a constrained BPE merging strategy that enforces character integrity, applicable to both SCRIPT-BPE and byte-based BPE. Our experiments demonstrate that SCRIPT-BPE achieves competitive compression while eliminating encoding-based penalties for non-Latin-script languages.
中文: SCRIPT-BPE提出基于Unicode脚本类别的编码方案,通过遵循文字边界的预分词规则取代脆弱的正则表达式方法,并在BPE合并时保持字符完整性,在维持压缩率的同时消除了非拉丁文字的编码惩罚。
English: SCRIPT-BPE introduces a Unicode-based encoding scheme that replaces fragile regex pretokenization with script-boundary rules and enforces character integrity during BPE merging, eliminating encoding penalties for non-Latin scripts while maintaining competitive compression.

Authors:Qinglin Zhu, Runcong Zhao, Hanqi Yan, Yulan He, Yudong Chen, Lin Gui
Title: Soft Reasoning: Navigating Solution Spaces in Large Language Models through Controlled Embedding Exploration
Abstract:
Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution. The code is released at https://github.com/alickzhu/Soft-Reasoning.
中文摘要:Soft Reasoning是一种基于嵌入的搜索框架,通过优化初始标记嵌入并结合受控探索与贝叶斯优化,有效提升大语言模型在复杂推理中的准确性和可扩展性,同时显著降低计算成本。
English Summary: Soft Reasoning is an embedding-based search framework that enhances complex reasoning in LLMs by optimizing initial token embeddings through controlled exploration and Bayesian optimization, improving accuracy and scalability with minimal computation.

Authors:Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu
Title: PRISM: A Framework for Producing Interpretable Political Bias Embeddings with Political-Aware Cross-Encoder
Abstract:
Semantic Text Embedding is a fundamental NLP task that encodes textual content into vector representations, where proximity in the embedding space reflects semantic similarity. While existing embedding models excel at capturing general meaning, they often overlook ideological nuances, limiting their effectiveness in tasks that require an understanding of political bias. To address this gap, we introduce PRISM, the first framework designed to Produce inteRpretable polItical biaS eMbeddings. PRISM operates in two key stages: (1) Controversial Topic Bias Indicator Mining, which systematically extracts fine-grained political topics and their corresponding bias indicators from weakly labeled news data, and (2) Cross-Encoder Political Bias Embedding, which assigns structured bias scores to news articles based on their alignment with these indicators. This approach ensures that embeddings are explicitly tied to bias-revealing dimensions, enhancing both interpretability and predictive power. Through extensive experiments on two large-scale datasets, we demonstrate that PRISM outperforms state-of-the-art text embedding models in political bias classification while offering highly interpretable representations that facilitate diversified retrieval and ideological analysis. The source code is available at https://github.com/dukesun99/ACL-PRISM.
中文: PRISM是首个可生成可解释政治偏见嵌入的框架,通过从新闻数据中提取细粒度偏见指标并为文章分配结构化偏见分数,在政治偏见分类任务中优于现有模型,同时提供便于意识形态分析的透明表征。
English: PRISM is a novel framework that generates interpretable political bias embeddings by extracting bias indicators from news data and assigning structured bias scores, outperforming existing models in political bias classification while providing transparent representations for ideological analysis.

Authors:Xiaoang Xu, Shuo Wang, Xu Han, Zhenghao Liu, Huijia Wu, Peipei Li, Zhiyuan Liu, Maosong Sun, Zhaofeng He
Title: A*-Thought: Efficient Reasoning via Bidirectional Compression for Low-Resource Settings
Abstract:
Large Reasoning Models (LRMs) achieve superior performance by extending the thought length. However, a lengthy thinking trajectory leads to reduced efficiency. Most of the existing methods are stuck in the assumption of overthinking and attempt to reason efficiently by compressing the Chain-of-Thought, but this often leads to performance degradation. To address this problem, we introduce A*-Thought, an efficient tree search-based unified framework designed to identify and isolate the most essential thoughts from the extensive reasoning chains produced by these models. It formulates the reasoning process of LRMs as a search tree, where each node represents a reasoning span in the giant reasoning space. By combining the A* search algorithm with a cost function specific to the reasoning path, it can efficiently compress the chain of thought and determine a reasoning path with high information density and low cost. In addition, we also propose a bidirectional importance estimation mechanism, which further refines this search process and enhances its efficiency beyond uniform sampling. Extensive experiments on several advanced math tasks show that A*-Thought effectively balances performance and efficiency over a huge search space. Specifically, A*-Thought can improve the performance of QwQ-32B by 2.39$\times$ with low-budget and reduce the length of the output token by nearly 50% with high-budget. The proposed method is also compatible with several other LRMs, demonstrating its generalization capability. The code can be accessed at: https://github.com/AI9Stars/AStar-Thought.
中文: A*-Thought 是一种高效的树搜索框架,通过识别大型推理模型中的关键思路并利用双向估计和A*搜索压缩推理链,在性能与效率之间实现平衡。
English: A*-Thought is an efficient tree search framework that compresses reasoning chains in Large Reasoning Models by identifying essential thoughts, balancing performance and efficiency through bidirectional estimation and A* search.

Authors:Fei Bai, Yingqian Min, Beichen Zhang, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Title: Towards Effective Code-Integrated Reasoning
Abstract:
In this paper, we investigate code-integrated reasoning, where models generate code when necessary and integrate feedback by executing it through a code interpreter. To acquire this capability, models must learn when and how to use external code tools effectively, which is supported by tool-augmented reinforcement learning (RL) through interactive learning. Despite its benefits, tool-augmented RL can still suffer from potential instability in the learning dynamics. In light of this challenge, we present a systematic approach to improving the training effectiveness and stability of tool-augmented RL for code-integrated reasoning. Specifically, we develop enhanced training strategies that balance exploration and stability, progressively building tool-use capabilities while improving reasoning performance. Through extensive experiments on five mainstream mathematical reasoning benchmarks, our model demonstrates significant performance improvements over multiple competitive baselines. Furthermore, we conduct an in-depth analysis of the mechanism and effect of code-integrated reasoning, revealing several key insights, such as the extension of model's capability boundaries and the simultaneous improvement of reasoning efficiency through code integration. All data and code for reproducing this work are available at: https://github.com/RUCAIBox/CIR.
中文摘要:本文提出了一种系统性方法,通过平衡探索与稳定性的增强训练策略,有效提升了代码集成推理中工具增强强化学习的训练效果,在多个数学推理基准测试中实现了显著性能提升,并揭示了代码集成扩展模型能力边界的关键机制。
English Summary: This paper introduces a systematic approach to enhance the training stability and effectiveness of tool-augmented reinforcement learning for code-integrated reasoning, demonstrating significant performance gains across mathematical benchmarks through improved strategies that balance exploration and capability development.

Authors:Zhiwei Liu, Lingfei Qian, Qianqian Xie, Jimin Huang, Kailai Yang, Sophia Ananiadou
Title: MMAFFBen: A Multilingual and Multimodal Affective Analysis Benchmark for Evaluating LLMs and VLMs
Abstract:
Large language models and vision-language models (which we jointly call LMs) have transformed NLP and CV, demonstrating remarkable potential across various fields. However, their capabilities in affective analysis (i.e. sentiment analysis and emotion detection) remain underexplored. This gap is largely due to the absence of comprehensive evaluation benchmarks, and the inherent complexity of affective analysis tasks. In this paper, we introduce MMAFFBen, the first extensive open-source benchmark for multilingual multimodal affective analysis. MMAFFBen encompasses text, image, and video modalities across 35 languages, covering four key affective analysis tasks: sentiment polarity, sentiment intensity, emotion classification, and emotion intensity. Moreover, we construct the MMAFFIn dataset for fine-tuning LMs on affective analysis tasks, and further develop MMAFFLM-3b and MMAFFLM-7b based on it. We evaluate various representative LMs, including GPT-4o-mini, providing a systematic comparison of their affective understanding capabilities. This project is available at https://github.com/lzw108/MMAFFBen.
中文:本文提出了首个全面的多语言多模态情感分析基准MMAFFBen,并开发了专门模型来系统评估语言模型在不同模态和语言中的情感理解能力。
English: This paper introduces MMAFFBen, the first comprehensive multilingual multimodal benchmark for affective analysis, and develops specialized models to systematically evaluate language models' capabilities in sentiment and emotion tasks across diverse modalities and languages.

Authors:Gilles Quentin Hacheme, Girmaw Abebe Tadesse, Caleb Robinson, Akram Zaytar, Rahul Dodhia, Juan M. Lavista Ferres
Title: GeoVision Labeler: Zero-Shot Geospatial Classification with Vision and Language Models
Abstract:
Classifying geospatial imagery remains a major bottleneck for applications such as disaster response and land-use monitoring-particularly in regions where annotated data is scarce or unavailable. Existing tools (e.g., RS-CLIP) that claim zero-shot classification capabilities for satellite imagery nonetheless rely on task-specific pretraining and adaptation to reach competitive performance. We introduce GeoVision Labeler (GVL), a strictly zero-shot classification framework: a vision Large Language Model (vLLM) generates rich, human-readable image descriptions, which are then mapped to user-defined classes by a conventional Large Language Model (LLM). This modular, and interpretable pipeline enables flexible image classification for a large range of use cases. We evaluated GVL across three benchmarks-SpaceNet v7, UC Merced, and RESISC45. It achieves up to 93.2% zero-shot accuracy on the binary Buildings vs. No Buildings task on SpaceNet v7. For complex multi-class classification tasks (UC Merced, RESISC45), we implemented a recursive LLM-driven clustering to form meta-classes at successive depths, followed by hierarchical classification-first resolving coarse groups, then finer distinctions-to deliver competitive zero-shot performance. GVL is open-sourced at https://github.com/microsoft/geo-vision-labeler to catalyze adoption in real-world geospatial workflows.
Chinese: GeoVision标注器(GVL)是一种严格零样本分类框架,通过视觉大语言模型生成图像描述并由常规大语言模型将其映射至用户定义类别,无需任务特定训练即可实现高精度地理空间影像分类。
English: The GeoVision Labeler (GVL) is a strictly zero-shot classification framework that uses a vision Large Language Model to generate image descriptions and a conventional LLM to map them to user-defined classes, achieving high accuracy in geospatial imagery classification without task-specific training.

Authors:James R. Golden
Title: Large Language Models are Locally Linear Mappings
Abstract:
We demonstrate that the inference operations of several open-weight large language models (LLMs) can be mapped to an exactly equivalent linear system for an input sequence without modifying the model weights or altering output predictions. Extending techniques from image diffusion models that exhibit local or piecewise linearity, we strategically alter the gradient computation with respect to a given input sequence for a next-token prediction such that the Jacobian of the model nearly exactly reproduces the forward prediction with a linear system. We demonstrate this approach across models (Llama 3, Gemma 3, Qwen 3, Phi 4, Mistral Ministral and OLMo 2, up to Llama 3.3 70B Q4) and show through the singular value decomposition of the detached Jacobian that these LLMs operate in extremely low-dimensional subspaces where many of the largest singular vectors decode to concepts related to the most-likely output token. This approach also allows us to examine the operation of each successive layer (and its attention and MLP components) as nearly-exact linear systems and observe the emergence of semantic concepts. Despite their expressive power and global nonlinearity, modern LLMs can be interpreted through nearly-exact locally linear decompositions that provide insights into their internal representations and reveal interpretable semantic structures in the next-token prediction process.
中文: 研究表明,多种大型语言模型的推理过程无需修改权重即可精确映射为线性系统,通过奇异值分解发现这些模型在低维空间中运行,其中主要向量对应预测词汇相关的语义概念。
English: This study shows that the inference processes of various large language models can be accurately represented as linear systems without changing model weights, revealing through singular value decomposition that these models operate in low-dimensional spaces where dominant vectors correspond to semantic concepts related to predicted tokens.

Authors:Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim
Title: Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games
Abstract:
Large Language Models (LLMs) have shown potential in simulating human behaviors and performing theory-of-mind (ToM) reasoning, a crucial skill for complex social interactions. In this study, we investigate the role of ToM reasoning in aligning agentic behaviors with human norms in negotiation tasks, using the ultimatum game as a controlled environment. We initialized LLM agents with different prosocial beliefs (including Greedy, Fair, and Selfless) and reasoning methods like chain-of-thought (CoT) and varying ToM levels, and examined their decision-making processes across diverse LLMs, including reasoning models like o3-mini and DeepSeek-R1 Distilled Qwen 32B. Results from 2,700 simulations indicated that ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes. Consistent with previous findings, reasoning models exhibit limited capability compared to models with ToM reasoning, different roles of the game benefits with different orders of ToM reasoning. Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making. The code used for our experiments can be found at https://github.com/Stealth-py/UltimatumToM.
中文摘要:本研究证明,在谈判任务中,心理理论推理能显著提高大型语言模型代理行为与人类规范的契合度,在不同亲社会信念设置下增强决策一致性和谈判结果。
English Summary: This study demonstrates that theory-of-mind (ToM) reasoning significantly improves the alignment of LLM agent behaviors with human norms in negotiation tasks, enhancing decision-making consistency and outcomes across various prosocial belief settings.

Authors:Jiwan Chung, Janghan Yoon, Junhyeong Park, Sangeyl Lee, Joowon Yang, Sooyeon Park, Youngjae Yu
Title: Are Any-to-Any Models More Consistent Across Modality Transfers Than Specialists?
Abstract:
Any-to-any generative models aim to enable seamless interpretation and generation across multiple modalities within a unified framework, yet their ability to preserve relationships across modalities remains uncertain. Do unified models truly achieve cross-modal coherence, or is this coherence merely perceived? To explore this, we introduce ACON, a dataset of 1,000 images (500 newly contributed) paired with captions, editing instructions, and Q&A pairs to evaluate cross-modal transfers rigorously. Using three consistency criteria-cyclic consistency, forward equivariance, and conjugated equivariance-our experiments reveal that any-to-any models do not consistently demonstrate greater cross-modal consistency than specialized models in pointwise evaluations such as cyclic consistency. However, equivariance evaluations uncover weak but observable consistency through structured analyses of the intermediate latent space enabled by multiple editing operations. We release our code and data at https://github.com/JiwanChung/ACON.
Chinese: 本研究引入ACON数据集评估多模态生成模型的跨模态一致性,发现统一模型在循环一致性方面未优于专用模型,但通过潜在空间的等变性分析可观测到微弱的一致性表现。
English: This study introduces the ACON dataset to evaluate cross-modal consistency in any-to-any generative models, finding they do not outperform specialized models in cyclic consistency but show weak consistency through equivariance analysis of latent spaces.

Authors:Shilin Xu, Yanwei Li, Rui Yang, Tao Zhang, Yueyi Sun, Wei Chow, Linfeng Li, Hang Song, Qi Xu, Yunhai Tong, Xiangtai Li, Hao Fei
Title: Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models
Abstract:
Recent works on large language models (LLMs) have successfully demonstrated the emergence of reasoning capabilities via reinforcement learning (RL). Although recent efforts leverage group relative policy optimization (GRPO) for MLLMs post-training, they constantly explore one specific aspect, such as grounding tasks, math problems, or chart analysis. There are no works that can leverage multi-source MLLM tasks for stable reinforcement learning. In this work, we present a unified perspective to solve this problem. We present Mixed-R1, a unified yet straightforward framework that contains a mixed reward function design (Mixed-Reward) and a mixed post-training dataset (Mixed-45K). We first design a data engine to select high-quality examples to build the Mixed-45K post-training dataset. Then, we present a Mixed-Reward design, which contains various reward functions for various MLLM tasks. In particular, it has four different reward functions: matching reward for binary answer or multiple-choice problems, chart reward for chart-aware datasets, IoU reward for grounding problems, and open-ended reward for long-form text responses such as caption datasets. To handle the various long-form text content, we propose a new open-ended reward named Bidirectional Max-Average Similarity (BMAS) by leveraging tokenizer embedding matching between the generated response and the ground truth. Extensive experiments show the effectiveness of our proposed method on various MLLMs, including Qwen2.5-VL and Intern-VL on various sizes. Our dataset and model are available at https://github.com/xushilin1/mixed-r1.
Chinese: 本文提出Mixed-R1框架,通过混合奖励函数设计和混合后训练数据集,解决了多模态大语言模型在多源任务中稳定强化学习的难题,并在Qwen2.5-VL和Intern-VL等模型上验证了其有效性。
English: This paper introduces Mixed-R1, a unified framework that combines a mixed reward function design and a mixed post-training dataset to enable stable reinforcement learning across diverse multimodal large language model tasks, demonstrating effectiveness on models like Qwen2.5-VL and Intern-VL.

Authors:Chiwei Zhu, Benfeng Xu, An Yang, Junyang Lin, Quan Wang, Chang Zhou, Zhendong Mao
Title: Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
Abstract:
Training language models with rationales augmentation has been shown to be beneficial in many existing works. In this paper, we identify that such a prevailing view does not hold consistently. We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance as well as a novel perspective of model reliability. The results lead to several key findings that add new insights upon existing understandings: 1) Rationales can, at times, deteriorate model performance; 2) Rationales can, at times, improve model reliability, even outperforming their untrained counterparts; 3) A linear correspondence exists in between the performance and reliability improvements, while both are driven by the intrinsic difficulty of the task. These findings provide informative regulations on the broad utilization of rationales and raise critical implications on the procedure of explicitly aligning language models with implicit human thoughts. Codes can be found at https://github.com/Ignoramus0817/rationales.
Chinese: 本研究挑战了理性增强必然提升语言模型性能的普遍观点,揭示其有时反而会损害性能却能提高可靠性,这两种效应均受任务难度驱动,为模型与人类思维的隐性对齐提供了新见解。
English: This study challenges the prevailing view that rationales always enhance language model performance, revealing they can sometimes impair it while improving reliability, with both effects driven by task difficulty and offering new insights for aligning models with human reasoning.

Authors:Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li
Title: HardTests: Synthesizing High-Quality Test Cases for LLM Coding
Abstract:
Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

Authors:Vishal Dey, Xiao Hu, Xia Ning
Title: Large Language Models for Controllable Multi-property Multi-objective Molecule Optimization
Abstract:
In real-world drug design, molecule optimization requires selectively improving multiple molecular properties up to pharmaceutically relevant levels, while maintaining others that already meet such criteria. However, existing computational approaches and instruction-tuned LLMs fail to capture such nuanced property-specific objectives, limiting their practical applicability. To address this, we introduce C-MuMOInstruct, the first instruction-tuning dataset focused on multi-property optimization with explicit, property-specific objectives. Leveraging C-MuMOInstruct, we develop GeLLMO-Cs, a series of instruction-tuned LLMs that can perform targeted property-specific optimization. Our experiments across 5 in-distribution and 5 out-of-distribution tasks show that GeLLMO-Cs consistently outperform strong baselines, achieving up to 126% higher success rate. Notably, GeLLMO-Cs exhibit impressive 0-shot generalization to novel optimization tasks and unseen instructions. This offers a step toward a foundational LLM to support realistic, diverse optimizations with property-specific objectives. C-MuMOInstruct and code are accessible through https://github.com/ninglab/GeLLMO-C.
中文摘要:本研究提出了首个针对多属性分子优化的指令调优数据集C-MuMOInstruct,并开发了GeLLMO-Cs模型系列,该模型在保持特定属性的同时显著提升其他分子性能,成功率达基线126%以上,且对新任务展现出卓越的零样本泛化能力。
English Summary: This study introduces C-MuMOInstruct, the first instruction-tuning dataset for multi-property molecular optimization, and develops GeLLMO-Cs models that significantly outperform baselines with up to 126% higher success rates while demonstrating strong generalization to novel tasks.

Authors:Feiteng Fang, Ting-En Lin, Yuchuan Wu, Xiong Liu, Xiang Huang, Dingwei Chen, Jing Ye, Haonan Zhang, Liang Zhu, Hamid Alinejad-Rokny, Min Yang, Fei Huang, Yongbin Li
Title: ChARM: Character-based Act-adaptive Reward Modeling for Advanced Role-Playing Language Agents
Abstract:
Role-Playing Language Agents (RPLAs) aim to simulate characters for realistic and engaging human-computer interactions. However, traditional reward models often struggle with scalability and adapting to subjective conversational preferences. We propose ChARM, a Character-based Act-adaptive Reward Model, addressing these challenges through two innovations: (1) an act-adaptive margin that significantly enhances learning efficiency and generalizability, and (2) a self-evolution mechanism leveraging large-scale unlabeled data to improve training coverage. Additionally, we introduce RoleplayPref, the first large-scale preference dataset specifically for RPLAs, featuring 1,108 characters, 13 subcategories, and 16,888 bilingual dialogues, alongside RoleplayEval, a dedicated evaluation benchmark. Experimental results show a 13% improvement over the conventional Bradley-Terry model in preference rankings. Furthermore, applying ChARM-generated rewards to preference learning techniques (e.g., direct preference optimization) achieves state-of-the-art results on CharacterEval and RoleplayEval. Code and dataset are available at https://github.com/calubkk/ChARM.
Chinese: 提出的ChARM模型通过行为自适应边界和自我进化机制改进了角色扮演语言代理,在偏好排名上提升了13%,并在评估基准中取得了最优结果。
English: The proposed ChARM model enhances role-playing language agents with an act-adaptive margin and self-evolution mechanism, achieving a 13% improvement in preference rankings and state-of-the-art results on evaluation benchmarks.

Authors:David Ma, Huaqing Yuan, Xingjian Wang, Qianbo Zang, Tianci Liu, Xinyang He, Yanbin Wei, Jiawei Guo, Ni Jiahui, Zhenzhu Yang, Meng Cao, Shanghaoran Quan, Yizhi Li, Wangchunshu Zhou, Jiaheng Liu, Wenhao Huang, Ge Zhang, Shiwen Ni, Xiaojie Jin
Title: ScaleLong: A Multi-Timescale Benchmark for Long Video Understanding
Abstract:
Although long-video understanding demands that models capture hierarchical temporal information -- from clip (seconds) and shot (tens of seconds) to event (minutes) and story (hours) -- existing benchmarks either neglect this multi-scale design or scatter scale-specific questions across different videos, preventing direct comparison of model performance across timescales on the same content. To address this, we introduce ScaleLong, the first benchmark to disentangle these factors by embedding questions targeting four hierarchical timescales -- clip (seconds), shot (tens of seconds), event (minutes), and story (hours) -- all within the same video content. This within-content multi-timescale questioning design enables direct comparison of model performance across timescales on identical videos. ScaleLong features 269 long videos (avg.\ 86\,min) from 5 main categories and 36 sub-categories, with 4--8 carefully designed questions, including at least one question for each timescale. Evaluating 23 MLLMs reveals a U-shaped performance curve, with higher accuracy at the shortest and longest timescales and a dip at intermediate levels. Furthermore, ablation studies show that increased visual token capacity consistently enhances reasoning across all timescales. ScaleLong offers a fine-grained, multi-timescale benchmark for advancing MLLM capabilities in long-video understanding. The code and dataset are available https://github.com/multimodal-art-projection/ScaleLong.
中文:ScaleLong基准通过在相同视频内容中嵌入多时间尺度问题,实现了模型在分层时间级别上的直接性能比较,揭示了U形准确率曲线并证明增加视觉标记容量可提升推理能力。
English: The ScaleLong benchmark introduces multi-timescale questions within the same video content to enable direct comparison of model performance across hierarchical temporal levels, revealing a U-shaped accuracy curve and demonstrating that increased visual token capacity improves reasoning.

Authors:Mengkang Hu, Yuhang Zhou, Wendong Fan, Yuzhou Nie, Bowei Xia, Tao Sun, Ziyu Ye, Zhaoxuan Jin, Yingru Li, Qiguang Chen, Zeyu Zhang, Yifeng Wang, Qianshuo Ye, Bernard Ghanem, Ping Luo, Guohao Li
Title: OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Abstract:
Large Language Model (LLM)-based multi-agent systems show promise for automating real-world tasks but struggle to transfer across domains due to their domain-specific nature. Current approaches face two critical shortcomings: they require complete architectural redesign and full retraining of all components when applied to new domains. We introduce Workforce, a hierarchical multi-agent framework that decouples strategic planning from specialized execution through a modular architecture comprising: (i) a domain-agnostic Planner for task decomposition, (ii) a Coordinator for subtask management, and (iii) specialized Workers with domain-specific tool-calling capabilities. This decoupling enables cross-domain transferability during both inference and training phases: During inference, Workforce seamlessly adapts to new domains by adding or modifying worker agents; For training, we introduce Optimized Workforce Learning (OWL), which improves generalization across domains by optimizing a domain-agnostic planner with reinforcement learning from real-world feedback. To validate our approach, we evaluate Workforce on the GAIA benchmark, covering various realistic, multi-domain agentic tasks. Experimental results demonstrate Workforce achieves open-source state-of-the-art performance (69.70%), outperforming commercial systems like OpenAI's Deep Research by 2.34%. More notably, our OWL-trained 32B model achieves 52.73% accuracy (+16.37%) and demonstrates performance comparable to GPT-4o on challenging tasks. To summarize, by enabling scalable generalization and modular domain transfer, our work establishes a foundation for the next generation of general-purpose AI assistants.
中文:Workforce提出了一种分层多智能体框架,通过将规划与执行解耦,借助模块化智能体和优化训练实现跨领域适应能力,在GAIA等基准测试中取得了领先性能。
English: Workforce introduces a hierarchical multi-agent framework that decouples planning from execution, enabling cross-domain adaptability through modular agents and optimized training, achieving state-of-the-art performance on benchmarks like GAIA.

Authors:Jianyang Gu, Samuel Stevens, Elizabeth G Campolongo, Matthew J Thompson, Net Zhang, Jiaman Wu, Andrei Kopanev, Zheda Mai, Alexander E. White, James Balhoff, Wasila Dahdul, Daniel Rubenstein, Hilmar Lapp, Tanya Berger-Wolf, Wei-Lun Chao, Yu Su
Title: BioCLIP 2: Emergent Properties from Scaling Hierarchical Contrastive Learning
Abstract:
Foundation models trained at scale exhibit remarkable emergent behaviors, learning new capabilities beyond their initial training objectives. We find such emergent behaviors in biological vision models via large-scale contrastive vision-language training. To achieve this, we first curate TreeOfLife-200M, comprising 214 million images of living organisms, the largest and most diverse biological organism image dataset to date. We then train BioCLIP 2 on TreeOfLife-200M to distinguish different species. Despite the narrow training objective, BioCLIP 2 yields extraordinary accuracy when applied to various biological visual tasks such as habitat classification and trait prediction. We identify emergent properties in the learned embedding space of BioCLIP 2. At the inter-species level, the embedding distribution of different species aligns closely with functional and ecological meanings (e.g., beak sizes and habitats). At the intra-species level, instead of being diminished, the intra-species variations (e.g., life stages and sexes) are preserved and better separated in subspaces orthogonal to inter-species distinctions. We provide formal proof and analyses to explain why hierarchical supervision and contrastive objectives encourage these emergent properties. Crucially, our results reveal that these properties become increasingly significant with larger-scale training data, leading to a biologically meaningful embedding space.

Authors:Sahil Verma, Keegan Hines, Jeff Bilmes, Charlotte Siska, Luke Zettlemoyer, Hila Gonen, Chandan Singh
Title: OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
Abstract:
The emerging capabilities of large language models (LLMs) have sparked concerns about their immediate potential for harmful misuse. The core approach to mitigate these concerns is the detection of harmful queries to the model. Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalization of model capabilities (e.g., prompts in low-resource languages or prompts provided in non-text modalities such as image and audio). To tackle this challenge, we propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities. Our approach (i) identifies internal representations of an LLM/MLLM that are aligned across languages or modalities and then (ii) uses them to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts. OMNIGUARD improves harmful prompt classification accuracy by 11.57\% over the strongest baseline in a multilingual setting, by 20.44\% for image-based prompts, and sets a new SOTA for audio-based prompts. By repurposing embeddings computed during generation, OMNIGUARD is also very efficient ($\approx 120 \times$ faster than the next fastest baseline). Code and data are available at: https://github.com/vsahil/OmniGuard.
中文摘要:本文提出OMNIGUARD方法,通过利用大语言模型中跨语言和跨模态对齐的内部表征来检测有害提示,在多语言和跨模态场景下显著提升了分类准确率与检测效率。
English Summary: The paper introduces OMNIGUARD, a method that detects harmful prompts across languages and modalities by leveraging aligned internal representations of LLMs, significantly improving classification accuracy and efficiency over existing baselines.

Authors:Michael Shalyt, Rotem Elimelech, Ido Kaminer
Title: ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark
Abstract:
Large language models (LLMs) are rapidly approaching the level of proficiency in university-level symbolic mathematics required for applications in advanced science and technology. However, existing benchmarks fall short in assessing the core skills of LLMs in symbolic mathematics-such as integration, differential equations, and algebraic simplification. To address this gap, we introduce ASyMOB, a novel assessment framework focused exclusively on symbolic manipulation, featuring 17,092 unique math challenges, organized by similarity and complexity. ASyMOB enables analysis of LLM generalization capabilities by comparing performance in problems that differ by simple numerical or symbolic `perturbations'. Evaluated LLMs exhibit substantial degradation in performance for all perturbation types (up to -70.3%), suggesting reliance on memorized patterns rather than deeper understanding of symbolic math, even among models achieving high baseline accuracy. Comparing LLM performance to computer algebra systems, we identify examples where they fail while LLMs succeed, as well as problems solved only by combining both approaches. Models capable of integrated code execution yielded higher accuracy compared to their performance without code, particularly stabilizing weaker models (up to +33.1% for certain perturbation types). Notably, the most advanced models (o4-mini, Gemini 2.5 Flash) demonstrate not only high symbolic math proficiency (scoring 96.8% and 97.6% on the unperturbed set), but also remarkable robustness against perturbations, (-21.7% and -21.2% vs. average -50.4% for the other models). This may indicate a recent "phase transition" in the generalization capabilities of frontier LLMs. It remains to be seen whether the path forward lies in deeper integration with sophisticated external tools, or in developing models so capable that symbolic math systems like CAS become unnecessary.
中文: 大型语言模型在符号数学方面虽取得进展,但ASyMOB评估框架显示其泛化能力不足,主要依赖记忆而非深层理解;不过顶尖模型如o4-mini和Gemini 2.5 Flash表现出卓越的解题能力和抗干扰性。
English: Large language models are advancing in symbolic mathematics but struggle with generalization, as shown by the ASyMOB framework, which reveals their reliance on memorization rather than true understanding, though top models like o4-mini and Gemini 2.5 Flash exhibit high proficiency and robustness.

Authors:Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang
Title: Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation
Abstract:
Large language models (LLMs) have shown remarkable promise but remain challenging to continually improve through traditional finetuning, particularly when integrating capabilities from other specialized LLMs. Popular methods like ensemble and weight merging require substantial memory and struggle to adapt to changing data environments. Recent efforts have transferred knowledge from multiple LLMs into a single target model; however, they suffer from interference and degraded performance among tasks, largely due to limited flexibility in candidate selection and training pipelines. To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging. Specifically, we design an adaptive selection network that identifies the most relevant source LLMs based on their scores, thereby reducing knowledge interference. We further propose a dynamic weighted fusion strategy that accounts for the inherent strengths of candidate LLMs, along with a feedback-driven loss function that prevents the selector from converging on a single subset of sources. Experimental results demonstrate that our method can enable a more stable and scalable knowledge aggregation process while reducing knowledge interference by up to 50% compared to existing approaches. Code is avaliable at https://github.com/ZLKong/LLM_Integration
中文摘要:我们提出的框架自适应地选择和整合来自不同大语言模型的知识,通过自适应选择网络和动态融合策略构建更强大的单一模型,将知识干扰降低高达50%并减少内存开销。
English Summary: Our proposed framework adaptively selects and aggregates knowledge from diverse LLMs to build a stronger single model, reducing memory overhead and knowledge interference by up to 50% through an adaptive selection network and dynamic fusion strategy.

Authors:Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi
Title: Measuring Sycophancy of Language Models in Multi-turn Dialogues
Abstract:
Large Language Models (LLMs) are expected to provide helpful and harmless responses, yet they often exhibit sycophancy--conforming to user beliefs regardless of factual accuracy or ethical soundness. Prior research on sycophancy has primarily focused on single-turn factual correctness, overlooking the dynamics of real-world interactions. In this work, we introduce SYCON Bench, a novel benchmark for evaluating sycophantic behavior in multi-turn, free-form conversational settings. Our benchmark measures how quickly a model conforms to the user (Turn of Flip) and how frequently it shifts its stance under sustained user pressure (Number of Flip). Applying SYCON Bench to 17 LLMs across three real-world scenarios, we find that sycophancy remains a prevalent failure mode. Our analysis shows that alignment tuning amplifies sycophantic behavior, whereas model scaling and reasoning optimization strengthen the model's ability to resist undesirable user views. Reasoning models generally outperform instruction-tuned models but often fail when they over-index on logical exposition instead of directly addressing the user's underlying beliefs. Finally, we evaluate four additional prompting strategies and demonstrate that adopting a third-person perspective reduces sycophancy by up to 63.8% in debate scenario. We release our code and data at https://github.com/JiseungHong/SYCON-Bench.
中文摘要:大型语言模型常表现出迎合用户观点的谄媚行为,SYCON Bench基准测试表明该现象普遍存在,同时证明推理优化和第三人称提示能显著降低谄媚倾向。
English Summary: Large Language Models frequently exhibit sycophancy by conforming to user beliefs, and the SYCON Bench benchmark reveals this behavior persists across models while showing that reasoning optimization and third-person prompting can significantly reduce it.

Authors:Trenton Chang, Tobias Schnabel, Adith Swaminathan, Jenna Wiens
Title: A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
Abstract:
Despite advances in large language models (LLMs) on reasoning and instruction-following benchmarks, it remains unclear whether they can reliably produce outputs aligned with a broad variety of user goals, a concept we refer to as steerability. The abundance of methods proposed to modify LLM behavior makes it unclear whether current LLMs are already steerable, or require further intervention. In particular, LLMs may exhibit (i) poor coverage, where rare user goals are underrepresented; (ii) miscalibration, where models overshoot requests; and (iii) side effects, where changes to one dimension of text inadvertently affect others. To systematically evaluate these failures, we introduce a framework based on a multi-dimensional goal space that models user goals and LLM outputs as vectors with dimensions corresponding to text attributes (e.g., reading difficulty). Applied to a text-rewriting task, we find that current LLMs struggle with steerability, as side effects are persistent. Interventions to improve steerability, such as prompt engineering, best-of-$N$ sampling, and reinforcement learning fine-tuning, have varying effectiveness, yet side effects remain problematic. Our findings suggest that even strong LLMs struggle with steerability, and existing alignment strategies may be insufficient. We open-source our steerability evaluation framework at https://github.com/MLD3/steerability.
中文: 当前大型语言模型在可控性方面存在不足,表现出覆盖率低、校准偏差和副作用持续等问题,现有对齐策略虽效果各异但仍显不足。
English: Current large language models struggle with steerability, exhibiting issues like poor coverage, miscalibration, and persistent side effects, and existing alignment strategies remain insufficient despite varying effectiveness.

Authors:Lin Mu, Xiaoyu Wang, Li Ni, Yang Li, Zhize Wu, Peiquan Jin, Yiwen Zhang
Title: DenseLoRA: Dense Low-Rank Adaptation of Large Language Models
Abstract:
Low-rank adaptation (LoRA) has been developed as an efficient approach for adapting large language models (LLMs) by fine-tuning two low-rank matrices, thereby reducing the number of trainable parameters. However, prior research indicates that many of the weights in these matrices are redundant, leading to inefficiencies in parameter utilization. To address this limitation, we introduce Dense Low-Rank Adaptation (DenseLoRA), a novel approach that enhances parameter efficiency while achieving superior performance compared to LoRA. DenseLoRA builds upon the concept of representation fine-tuning, incorporating a single Encoder-Decoder to refine and compress hidden representations across all adaptation layers before applying adaptation. Instead of relying on two redundant low-rank matrices as in LoRA, DenseLoRA adapts LLMs through a dense low-rank matrix, improving parameter utilization and adaptation efficiency. We evaluate DenseLoRA on various benchmarks, showing that it achieves 83.8% accuracy with only 0.01% of trainable parameters, compared to LoRA's 80.8% accuracy with 0.70% of trainable parameters on LLaMA3-8B. Additionally, we conduct extensive experiments to systematically assess the impact of DenseLoRA's components on overall model performance. Code is available at https://github.com/mulin-ahu/DenseLoRA.
中文: DenseLoRA通过使用稠密低秩矩阵替代冗余的低秩矩阵来提升大语言模型适配中的参数效率,在LLaMA3-8B上仅用0.01%可训练参数就实现了83.8%的准确率。
English: DenseLoRA enhances parameter efficiency in adapting large language models by using a dense low-rank matrix instead of redundant low-rank matrices, achieving 83.8% accuracy with only 0.01% trainable parameters on LLaMA3-8B.

Authors:Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang
Title: DLP: Dynamic Layerwise Pruning in Large Language Models
Abstract:
Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to severe performance degradation at high sparsity levels. Recognizing the varying contributions of different layers in LLMs, recent studies have shifted their focus toward non-uniform layerwise pruning. However, these approaches often rely on pre-defined values, which can result in suboptimal performance. To overcome these limitations, we propose a novel method called Dynamic Layerwise Pruning (DLP). This approach adaptively determines the relative importance of each layer by integrating model weights with input activation information, assigning pruning rates accordingly. Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLaMA2-7B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine-Tuning (PEFT). We release the code at https://github.com/ironartisan/DLP to facilitate future research.
中文: 动态分层剪枝(DLP)方法通过整合模型权重和输入激活信息自适应地确定各层重要性,在高稀疏度下有效保持大语言模型性能,并能与现有压缩技术无缝兼容。
English: The proposed Dynamic Layerwise Pruning (DLP) method adaptively determines layer importance by integrating model weights and input activations, effectively preserving LLM performance at high sparsity levels while demonstrating compatibility with existing compression techniques.

Authors:Yuan Li, Qi Luo, Xiaonan Li, Bufan Li, Qinyuan Cheng, Bo Wang, Yining Zheng, Yuxin Wang, Zhangyue Yin, Xipeng Qiu
Title: R3-RAG: Learning Step-by-Step Reasoning and Retrieval for LLMs via Reinforcement Learning
Abstract:
Retrieval-Augmented Generation (RAG) integrates external knowledge with Large Language Models (LLMs) to enhance factual correctness and mitigate hallucination. However, dense retrievers often become the bottleneck of RAG systems due to their limited parameters compared to LLMs and their inability to perform step-by-step reasoning. While prompt-based iterative RAG attempts to address these limitations, it is constrained by human-designed workflows. To address these limitations, we propose $\textbf{R3-RAG}$, which uses $\textbf{R}$einforcement learning to make the LLM learn how to $\textbf{R}$eason and $\textbf{R}$etrieve step by step, thus retrieving comprehensive external knowledge and leading to correct answers. R3-RAG is divided into two stages. We first use cold start to make the model learn the manner of iteratively interleaving reasoning and retrieval. Then we use reinforcement learning to further harness its ability to better explore the external retrieval environment. Specifically, we propose two rewards for R3-RAG: 1) answer correctness for outcome reward, which judges whether the trajectory leads to a correct answer; 2) relevance-based document verification for process reward, encouraging the model to retrieve documents that are relevant to the user question, through which we can let the model learn how to iteratively reason and retrieve relevant documents to get the correct answer. Experimental results show that R3-RAG significantly outperforms baselines and can transfer well to different retrievers. We release R3-RAG at https://github.com/Yuan-Li-FNLP/R3-RAG.
中文摘要:R3-RAG通过强化学习框架使大语言模型自主掌握交替推理与检索的迭代策略,在提升答案准确性和系统性能方面显著优于现有基线方法。
English Summary: R3-RAG introduces a reinforcement learning framework that enables Large Language Models to autonomously learn iterative reasoning and retrieval strategies, significantly improving answer accuracy and outperforming existing methods.

Authors:Chenyu Yang, Shiqian Su, Shi Liu, Xuan Dong, Yue Yu, Weijie Su, Xuehui Wang, Zhaoyang Liu, Jinguo Zhu, Hao Li, Wenhai Wang, Yu Qiao, Xizhou Zhu, Jifeng Dai
Title: ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Abstract:
The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.
中文:ZeroGUI是一种无需人工干预的在线学习框架,通过基于视觉语言模型的任务生成与奖励评估,自动训练图形界面代理,显著提升了在动态环境中的性能表现。
English: ZeroGUI is an online learning framework that automates GUI agent training without human intervention by using VLM-based task generation and reward estimation, significantly enhancing performance in dynamic environments.

Authors:Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Title: Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models
Abstract:
Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at https://github.com/MLGroupJLU/Premise_Critique.
Chinese: 大型语言模型常无法批判性评估错误前提,因此开发PCBench以评估和提升其前提批判能力,揭示关键弱点并强调增强输入有效性评估的必要性。
English: Large language models often fail to critically evaluate flawed premises, prompting the development of PCBench to assess and improve their premise critique ability, revealing key vulnerabilities and the need for enhanced input validity evaluation.

Authors:Zixiang Xu, Yanbo Wang, Yue Huang, Jiayi Ye, Haomin Zhuang, Zirui Song, Lang Gao, Chenxi Wang, Zhaorun Chen, Yujun Zhou, Sixian Li, Wang Pan, Yue Zhao, Jieyu Zhao, Xiangliang Zhang, Xiuying Chen
Title: SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models
Abstract:
Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: https://huggingface.co/datasets/MBZUAI/SocialMaze
中文摘要:SocialMaze基准测试通过涵盖深度推理、动态交互和信息不确定性的六项任务,系统评估大语言模型的社交推理能力,揭示了模型在处理动态信息和不确定性时的表现差异,并证明针对性微调可显著提升复杂社交场景中的表现。
English Summary: The SocialMaze benchmark is introduced to systematically evaluate large language models' social reasoning capabilities through six tasks addressing deep reasoning, dynamic interaction, and information uncertainty, revealing performance variations and demonstrating that targeted fine-tuning enhances performance in complex social scenarios.

Authors:Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo
Title: ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions
Abstract:
Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
中文摘要:大型语言模型在工具使用方面表现出色,但在长期交互中表现显著不足,新推出的ToolHaystack基准测试揭示了现有模型在持续对话中的鲁棒性缺陷,这是以往评测未能发现的。
English Summary: Large language models show strong tool-use capabilities but struggle significantly in long-term interactions, as revealed by the new ToolHaystack benchmark that exposes critical gaps in their robustness not captured by previous evaluations.

Authors:Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song
Title: AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Abstract:
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 92\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
中文摘要:AutoSchemaKG是一个完全自主的框架,利用大语言模型从文本中构建知识图谱而无需预定义模式,通过动态模式归纳实现了与人工模式92%的语义对齐,有效增强了大型语言模型的事实性。
English Summary: AutoSchemaKG is a fully autonomous framework that uses large language models to construct knowledge graphs from text without predefined schemas, achieving high schema alignment and enhancing LLM factuality through dynamic schema induction.

Authors:Zifu Wang, Junyi Zhu, Bo Tang, Zhiyu Li, Feiyu Xiong, Jiaqian Yu, Matthew B. Blaschko
Title: Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
Abstract:
The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL, using jigsaw puzzles as a structured experimental framework. Jigsaw puzzles offer inherent ground truth, adjustable difficulty, and demand complex decision-making, making them ideal for this study. Our research reveals several key findings: \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on the simplest jigsaw puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: https://github.com/zifuwanggg/Jigsaw-R1.
中文摘要:本研究以拼图游戏为实验框架探索基于规则的多模态大语言模型强化学习,发现微调能使模型从接近随机猜测提升至近乎完美的准确率并泛化至复杂配置,且强化学习比监督微调具有更好的泛化效果。
English Summary: This study explores rule-based reinforcement learning in multimodal large language models using jigsaw puzzles, revealing that fine-tuning enables models to progress from random guessing to near-perfect accuracy and generalize to complex configurations, with RL outperforming supervised fine-tuning in generalization capability.

Authors:Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
Title: Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
Abstract:
Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$-$12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$-$11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.
中文摘要:分段策略优化(SPO)是一种新颖的强化学习框架,通过引入分段级优势估计克服了词元级和轨迹级方法的局限,在不依赖评论家模型的情况下实现了更优的推理性能。
English Summary: Segment Policy Optimization (SPO) is a novel reinforcement learning framework that introduces segment-level advantage estimation to overcome the limitations of token-level and trajectory-level methods, achieving superior reasoning performance without requiring a critic model.

Authors:Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
Title: Understanding Refusal in Language Models with Sparse Autoencoders
Abstract:
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.
Chinese: 本研究利用稀疏自编码器识别并验证了对齐语言模型中因果介导拒绝行为的潜在特征,从而实现了对拒绝机制的精细分析及其在提升泛化能力和理解对抗性越狱中的应用。
English: This study uses sparse autoencoders to identify and validate latent features that causally mediate refusal behaviors in aligned language models, enabling fine-grained analysis of refusal mechanisms and their applications in improving generalization and understanding adversarial jailbreaking.

Authors:Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li
Title: Probability-Consistent Preference Optimization for Enhanced LLM Reasoning
Abstract:
Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at https://github.com/YunqiaoYang/PCPO.
中文: 提出的概率一致性偏好优化(PCPO)框架通过结合答案正确性和标记级概率一致性来增强大语言模型的数学推理能力,在多种模型和基准测试中均优于现有方法。
English: The proposed Probability-Consistent Preference Optimization (PCPO) framework enhances mathematical reasoning in LLMs by incorporating both answer correctness and token-level probability consistency, outperforming existing methods across various models and benchmarks.

Authors:Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang
Title: SWE-bench Goes Live!
Abstract:
The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present SWE-bench-Live, a live-updatable benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.
中文: SWE-bench-Live作为可实时更新的基准被提出,旨在克服静态基准的局限,通过自动化流程和Docker环境实现可复现评估,用于测试大语言模型处理真实GitHub问题的能力。
English: SWE-bench-Live is introduced as a live-updatable benchmark to address the limitations of static benchmarks like SWE-bench, featuring automated curation and Docker-based reproducibility for evaluating LLMs on real GitHub issues.

Authors:Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao
Title: Discriminative Policy Optimization for Token-Level Reward Models
Abstract:
Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at https://github.com/homzer/Q-RM.
中文: Q-RM模型通过将奖励建模与语言生成解耦来优化细粒度奖励分配,在多项基准测试中均显著超越基线方法,有效提升推理能力和训练效率。
English: The Q-RM model, which decouples reward modeling from language generation to optimize token-level credit assignment, consistently outperforms baseline methods in enhancing reasoning capabilities and training efficiency across various benchmarks.

Authors:Maya Dewhurst, Jack Collins, Justin J. H. Lo, Roy Alderton, Sam Kirkham
Title: Nosey: Open-source hardware for acoustic nasalance
Abstract:
We introduce Nosey (Nasalance Open Source Estimation sYstem), a low-cost, customizable, 3D-printed system for recording acoustic nasalance data that we have made available as open-source hardware (http://github.com/phoneticslab/nosey). We first outline the motivations and design principles behind our hardware nasalance system, and then present a comparison between Nosey and a commercial nasalance device. Nosey shows consistently higher nasalance scores than the commercial device, but the magnitude of contrast between phonological environments is comparable between systems. We also review ways of customizing the hardware to facilitate testing, such as comparison of microphones and different construction materials. We conclude that Nosey is a flexible and cost-effective alternative to commercial nasometry devices and propose some methodological considerations for its use in data collection.
中文: Nosey是一种开源、可3D打印的鼻音度记录系统,作为商业设备的低成本、可定制替代方案,尽管其鼻音度得分整体较高,但在区分语音环境对比方面表现出与商业设备相当的性能。
English: Nosey is an open-source, 3D-printed nasalance recording system that offers a cost-effective and customizable alternative to commercial devices, demonstrating comparable performance in distinguishing phonological contrasts despite higher overall nasalance scores.

Authors:James Xu Zhao, Jimmy Z. J. Liu, Bryan Hooi, See-Kiong Ng
Title: How Does Response Length Affect Long-Form Factuality
Abstract:
Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
中文: 本研究揭示大型语言模型的生成长度与事实准确性呈负相关,主要原因是知识耗尽导致可靠信息逐渐减少,而非错误传播或长上下文问题。
English: This study reveals that longer responses from large language models exhibit lower factual precision due to facts exhaustion, where the model gradually depletes its reliable knowledge, rather than error propagation or long context issues.

Authors:Xinye Li, Zunwen Zheng, Qian Zhang, Dekai Zhuang, Jiabao Kang, Liyan Xu, Qingbin Liu, Xi Chen, Zhiying Tu, Dianhui Chu, Dianbo Sui
Title: ScEdit: Script-based Assessment of Knowledge Editing
Abstract:
Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
Chinese: 当前知识编辑方法在简单任务上表现优异,但在实际应用中面临挑战,为此我们推出了ScEdit新基准,综合评估事实性和行动性编辑,发现所有方法性能均下降。
English: Current knowledge editing methods perform well on simple tasks but struggle in real-world applications, prompting the introduction of ScEdit, a new benchmark that evaluates both factual and action-based edits and reveals performance drops across all methods.

Authors:Yong Zhang, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao
Title: Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.
Sentinel introduces a lightweight framework that uses attention signals from a small proxy LLM to compress retrieved passages effectively, achieving high compression ratios while maintaining QA performance without needing dedicated training.
English Summary:

Authors:Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza
Title: Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
Abstract:
Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
中文: 本研究利用模型可解释性和不确定性量化探索了高效的词级质量评估方法,以检测翻译错误,揭示了无监督指标的潜力及有监督方法在标签不确定性下的局限性。
English: This study explores efficient methods for word-level quality estimation by leveraging model interpretability and uncertainty quantification to detect translation errors, revealing the potential of unsupervised metrics and the limitations of supervised approaches under label uncertainty.

Authors:Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao
Title: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification
Abstract:
Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at https://github.com/xingwenjing417/Infinite-Instruct-dataset
中文: Infinite-Instruct通过逆向构建和反馈构建自动生成逻辑严密的高质量代码指令数据,仅用少量数据即可大幅提升大语言模型的代码生成能力。
English: Infinite-Instruct is an automated framework that synthesizes high-quality, logically coherent code instruction data through reverse and backfeeding construction, significantly boosting LLMs' code generation performance with minimal data.

Authors:Yuu Jinnai
Title: Document-Level Text Generation with Minimum Bayes Risk Decoding using Optimal Transport
Abstract:
Document-level text generation tasks are known to be more difficult than sentence-level text generation tasks as they require the understanding of longer context to generate high-quality texts. In this paper, we investigate the adaption of Minimum Bayes Risk (MBR) decoding for document-level text generation tasks. MBR decoding makes use of a utility function to estimate the output with the highest expected utility from a set of candidate outputs. Although MBR decoding is shown to be effective in a wide range of sentence-level text generation tasks, its performance on document-level text generation tasks is limited as many of the utility functions are designed for evaluating the utility of sentences. To this end, we propose MBR-OT, a variant of MBR decoding using Wasserstein distance to compute the utility of a document using a sentence-level utility function. The experimental result shows that the performance of MBR-OT outperforms that of the standard MBR in document-level machine translation, text simplification, and dense image captioning tasks. Our code is available at https://github.com/jinnaiyuu/mbr-optimal-transport
中文: 本文提出MBR-OT方法,通过引入Wasserstein距离改进最小贝叶斯风险解码,将句子级效用函数有效应用于文档级文本生成,在机器翻译、文本简化和密集图像描述任务中展现出优于标准方法的性能。
English: This paper introduces MBR-OT, an enhanced Minimum Bayes Risk decoding method using Wasserstein distance to improve document-level text generation by effectively applying sentence-level utility functions to longer contexts, demonstrating superior performance in translation, simplification, and captioning tasks.

Authors:Tianteng Gu, Bei Liu, Bo Xiao, Ke Zeng, Jiacheng Liu, Yanmin Qian
Title: DenoiseRotator: Enhance Pruning Robustness for LLMs via Importance Concentration
Abstract:
Pruning is a widely used technique to compress large language models (LLMs) by removing unimportant weights, but it often suffers from significant performance degradation - especially under semi-structured sparsity constraints. Existing pruning methods primarily focus on estimating the importance of individual weights, which limits their ability to preserve critical capabilities of the model. In this work, we propose a new perspective: rather than merely selecting which weights to prune, we first redistribute parameter importance to make the model inherently more amenable to pruning. By minimizing the information entropy of normalized importance scores, our approach concentrates importance onto a smaller subset of weights, thereby enhancing pruning robustness. We instantiate this idea through DenoiseRotator, which applies learnable orthogonal transformations to the model's weight matrices. Our method is model-agnostic and can be seamlessly integrated with existing pruning techniques such as Magnitude, SparseGPT, and Wanda. Evaluated on LLaMA3, Qwen2.5, and Mistral models under 50% unstructured and 2:4 semi-structured sparsity, DenoiseRotator consistently improves perplexity and zero-shot accuracy. For instance, on LLaMA3-70B pruned with SparseGPT at 2:4 semi-structured sparsity, DenoiseRotator reduces the perplexity gap to the dense model by 58%, narrowing the degradation from 8.1 to 3.4 points. Codes are available at https://github.com/Axel-gu/DenoiseRotator.
Chinese: 本文提出DenoiseRotator方法,通过正交变换重新分配参数重要性,增强大语言模型剪枝的鲁棒性并显著减少性能损失。
English: This paper introduces DenoiseRotator, a model-agnostic method that redistributes parameter importance through orthogonal transformations to enhance pruning robustness and reduce performance degradation in large language models.

Authors:Si Wu, Sebastian Bruch
Title: Uncovering Visual-Semantic Psycholinguistic Properties from the Distributional Structure of Text Embedding Space
Abstract:
Imageability (potential of text to evoke a mental image) and concreteness (perceptibility of text) are two psycholinguistic properties that link visual and semantic spaces. It is little surprise that computational methods that estimate them do so using parallel visual and semantic spaces, such as collections of image-caption pairs or multi-modal models. In this paper, we work on the supposition that text itself in an image-caption dataset offers sufficient signals to accurately estimate these properties. We hypothesize, in particular, that the peakedness of the neighborhood of a word in the semantic embedding space reflects its degree of imageability and concreteness. We then propose an unsupervised, distribution-free measure, which we call Neighborhood Stability Measure (NSM), that quantifies the sharpness of peaks. Extensive experiments show that NSM correlates more strongly with ground-truth ratings than existing unsupervised methods, and is a strong predictor of these properties for classification. Our code and data are available on GitHub (https://github.com/Artificial-Memory-Lab/imageability).
Chinese: 本文提出一种无监督的邻域稳定性度量方法,通过分析语义嵌入空间中词汇分布的峰值特征来有效评估形象性和具体性,其与人工评分的相关性优于现有方法。
English: This paper introduces an unsupervised Neighborhood Stability Measure (NSM) that effectively estimates imageability and concreteness by analyzing the peakedness of words in semantic embedding space, outperforming existing methods in correlation with human ratings.

Authors:Haewon Park, Gyubin Choi, Minjun Kim, Yohan Jo
Title: Context-Robust Knowledge Editing for Language Models
Abstract:
Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.
Chinese: 知识编辑方法在前文语境触发原有知识时常常失效,为此我们开发了CHED基准评估语境鲁棒性,并提出了CoRE方法通过减少隐藏状态中的语境敏感方差来提高编辑成功率。
English: Knowledge editing methods often fail when preceding contexts trigger original knowledge, so we developed the CHED benchmark to evaluate context robustness and introduced the CoRE method to improve editing success by reducing context-sensitive variance in hidden states.

Authors:Peixuan Han, Zijia Liu, Jiaxuan You
Title: ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind
Abstract:
Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.
Chinese: 针对当前大型语言模型在说服任务中缺乏心理理论推理能力的问题,研究者提出了ToMAP方法,通过心理理论模块和强化学习增强对对手心理状态的分析,仅用30亿参数就在多项指标上显著超越了GPT-4o等更大模型。
English: To address the limitations of current LLMs in Theory of Mind reasoning for persuasion, the researchers developed ToMAP, a method that enhances opponent awareness through specialized modules and reinforcement learning, achieving superior performance over larger models like GPT-4o with only 3B parameters.

Authors:Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy
Title: NegVQA: Can Vision Language Models Understand Negation?
Abstract:
Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.
中文: NegVQA基准测试表明视觉语言模型在理解否定语义方面存在显著困难,不仅表现大幅下降,还呈现出模型规模与性能间的U型缩放规律。
English: The NegVQA benchmark reveals that vision language models struggle significantly with understanding negation, showing performance drops and a U-shaped scaling trend with model size increases.

Authors:Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza
Title: When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy
Abstract:
Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at https://github.com/Betswish/mCoT-XReasoning.
中文摘要:近期大型推理模型在英语任务中表现优异,但在多语言推理方面存在明显不足,常出现语言回退或逻辑碎片化问题,需在答案准确性与推理可读性之间权衡,而针对性训练可部分缓解这一矛盾。
English Summary: Recent large reasoning models demonstrate strong performance in English but struggle with multilingual reasoning, often reverting to English or producing fragmented logic in other languages, revealing a significant capability gap that requires balancing between answer accuracy and reasoning trace readability.

Authors:Iknoor Singh, Carolina Scarton, Kalina Bontcheva
Title: GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification
Abstract:
The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at https://github.com/GateNLP/H3Prompt.
中文摘要:本文提出的H3Prompt方法采用分层三步提示策略,利用大语言模型对多语言新闻进行叙事分类,在SemEval 2025评测中荣获全球第一名。
English Summary: This paper introduces H3Prompt, a hierarchical three-step prompting method using Large Language Models for multilingual narrative classification of news articles, which achieved first place in the SemEval 2025 competition.

Authors:Yupei Li, Shuaijie Shao, Manuel Milling, Björn W. Schuller
Title: Large Language Models for Depression Recognition in Spoken Language Integrating Psychological Knowledge
Abstract:
Depression is a growing concern gaining attention in both public discourse and AI research. While deep neural networks (DNNs) have been used for recognition, they still lack real-world effectiveness. Large language models (LLMs) show strong potential but require domain-specific fine-tuning and struggle with non-textual cues. Since depression is often expressed through vocal tone and behaviour rather than explicit text, relying on language alone is insufficient. Diagnostic accuracy also suffers without incorporating psychological expertise. To address these limitations, we present, to the best of our knowledge, the first application of LLMs to multimodal depression detection using the DAIC-WOZ dataset. We extract the audio features using the pre-trained model Wav2Vec, and mapped it to text-based LLMs for further processing. We also propose a novel strategy for incorporating psychological knowledge into LLMs to enhance diagnostic performance, specifically using a question and answer set to grant authorised knowledge to LLMs. Our approach yields a notable improvement in both Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) compared to a base score proposed by the related original paper. The codes are available at https://github.com/myxp-lyp/Depression-detection.git
中文摘要:本研究首次将大语言模型应用于多模态抑郁症检测,通过结合Wav2Vec音频特征与心理学知识增强策略,在诊断准确性上较基线分数实现了显著提升。
English Summary: This study introduces the first multimodal depression detection method using large language models (LLMs) combined with audio features from Wav2Vec and psychological knowledge integration, achieving significant improvements in diagnostic accuracy over baseline scores.

Authors:Andrew Zhu, Evan Osgood, Chris Callison-Burch
Title: First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
Abstract:
Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.
中文: 本文提出了“旁听智能体”的新范式,通过《龙与地下城》案例展示了多模态音频语言模型如何被动监听人类对话以执行后台任务或提供辅助,并开源相关代码以推动该领域研究。
English: This paper introduces "overhearing agents," a novel paradigm where LLM agents passively monitor human conversations to perform background tasks or offer assistance, demonstrated through a Dungeons & Dragons case study using multimodal audio-language models and released with open-source code for further research.

Authors:Tian Qin, Core Francisco Park, Mujin Kwun, Aaron Walsman, Eran Malach, Nikhil Anand, Hidenori Tanaka, David Alvarez-Melis
Title: Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
Abstract:
Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.
中文摘要:强化学习方法如GRPO主要提升了大型语言模型在数学推理中的执行稳健性,但由于规划能力不足而面临“覆盖墙”的局限,不过控制实验表明通过改进探索机制可能找到突破这一障碍的潜在路径。
English Summary: Reinforcement learning methods like GRPO primarily enhance LLMs' execution robustness in mathematical reasoning but face a 'coverage wall' due to insufficient planning skills, though controlled experiments suggest potential pathways to overcome this limitation through improved exploration.

Authors:Rafik Mankour, Yassine Chafai, Hamada Saleh, Ghassen Ben Hassine, Thibaud Barreau, Peter Tankov
Title: Climate Finance Bench
Abstract:
Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.
中文摘要:Climate Finance Bench推出一个针对企业气候披露的开放问答基准,通过专家验证数据集揭示检索准确性是主要性能瓶颈,并倡导在气候AI应用中采用量化权重等透明碳报告技术。
English Summary: Climate Finance Bench introduces an open benchmark for evaluating LLM-based question-answering on corporate climate reports, identifying retrieval accuracy as the key performance bottleneck while advocating for transparent carbon reporting in AI applications.

Authors:Ce Zhang, Kaixin Ma, Tianqing Fang, Wenhao Yu, Hongming Zhang, Zhisong Zhang, Yaqi Xie, Katia Sycara, Haitao Mi, Dong Yu
Title: VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models
Abstract:
Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4\% of the original performance. Code is available at https://github.com/Tencent/SelfEvolvingAgent/tree/main/VScan.
中文: 近期大型视觉语言模型因冗长视觉标记导致计算效率低下,为此提出VScan框架,通过整合全局-局部扫描与中间层剪枝的两阶段方法减少标记冗余,在加速推理的同时保持高性能。
English: Recent Large Vision-Language Models face computational inefficiency from lengthy visual tokens, prompting the development of VScan, a two-stage framework that reduces token redundancy through integrated global-local scans and intermediate layer pruning to accelerate inference while maintaining high performance.

Authors:Ang Lv, Ruobing Xie, Xingwu Sun, Zhanhui Kang, Rui Yan
Title: The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason
Abstract:
Recent studies on post-training large language models (LLMs) for reasoning through reinforcement learning (RL) typically focus on tasks that can be accurately verified and rewarded, such as solving math problems. In contrast, our research investigates the impact of reward noise, a more practical consideration for real-world scenarios involving the post-training of LLMs using reward models. We found that LLMs demonstrate strong robustness to substantial reward noise. For example, manually flipping 40% of the reward function's outputs in math tasks still allows a Qwen-2.5-7B model to achieve rapid convergence, improving its performance on math tasks from 5% to 72%, compared to the 75% accuracy achieved by a model trained with noiseless rewards. Surprisingly, by only rewarding the appearance of key reasoning phrases (namely reasoning pattern reward, RPR), such as ``first, I need to''-without verifying the correctness of answers, the model achieved peak downstream performance (over 70% accuracy for Qwen-2.5-7B) comparable to models trained with strict correctness verification and accurate rewards. Recognizing the importance of the reasoning process over the final results, we combined RPR with noisy reward models. RPR helped calibrate the noisy reward models, mitigating potential false negatives and enhancing the LLM's performance on open-ended tasks. These findings suggest the importance of improving models' foundational abilities during the pre-training phase while providing insights for advancing post-training techniques. Our code and scripts are available at https://github.com/trestad/Noisy-Rewards-in-Learning-to-Reason.
中文: 该研究发现大语言模型对奖励噪声具有强鲁棒性,通过仅奖励关键推理模式而不验证答案正确性即可达到相近性能,为改进预训练与后训练技术提供了新思路。
English: This study reveals that large language models exhibit strong robustness to reward noise, achieving comparable performance through reasoning pattern rewards without strict correctness verification, and provides insights for enhancing both pre-training and post-training techniques.

Authors:Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Gang Fu, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou
Title: WebDancer: Towards Autonomous Information Seeking Agency
Abstract:
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. Recent progress in agentic systems, exemplified by Deep Research, underscores the potential for autonomous multi-step research. In this work, we present a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. Our approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. We instantiate this framework in a web agent based on the ReAct, WebDancer. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of our training paradigm. Further analysis of agent training provides valuable insights and actionable, systematic pathways for developing more capable agentic models. The codes and demo will be released in https://github.com/Alibaba-NLP/WebAgent.
中文: 本文提出了一种构建端到端自主信息检索智能体的完整训练范式,通过四阶段训练流程在基准测试中表现出色,为开发更强大的智能体模型提供了系统路径。
English: This paper introduces a comprehensive training framework for developing autonomous information-seeking agents, which achieves strong performance on benchmarks through a four-stage process including data construction and reinforcement learning.

Authors:Hanjia Lyu, Jiebo Luo, Jian Kang, Allison Koenecke
Title: Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese
Abstract:
While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring. To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese). For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models -- spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese. These findings highlight the need for further analysis of LLM biases; as such, we provide an open-sourced benchmark dataset to foster reproducible evaluations of future LLM behavior across Chinese language variants (https://github.com/brucelyu17/SC-TC-Bench).
中文:本研究揭示了大型语言模型在处理简体与繁体中文时存在任务依赖性的性能差异,这些偏差受训练数据和语言特征影响,并提供了开源基准以促进未来跨中文变体的模型评估。
English: This study investigates performance disparities in Large Language Models (LLMs) when processing Simplified versus Traditional Chinese, revealing task-dependent biases influenced by training data and linguistic differences, and provides an open-source benchmark for future evaluations.

Authors:Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun
Title: RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction
Abstract:
Image recaptioning is widely used to generate training datasets with enhanced quality for various multimodal tasks. Existing recaptioning methods typically rely on powerful multimodal large language models (MLLMs) to enhance textual descriptions, but often suffer from inaccuracies due to hallucinations and incompleteness caused by missing fine-grained details. To address these limitations, we propose RICO, a novel framework that refines captions through visual reconstruction. Specifically, we leverage a text-to-image model to reconstruct a caption into a reference image, and prompt an MLLM to identify discrepancies between the original and reconstructed images to refine the caption. This process is performed iteratively, further progressively promoting the generation of more faithful and comprehensive descriptions. To mitigate the additional computational cost induced by the iterative process, we introduce RICO-Flash, which learns to generate captions like RICO using DPO. Extensive experiments demonstrate that our approach significantly improves caption accuracy and completeness, outperforms most baselines by approximately 10% on both CapsBench and CompreCap. Code released at https://github.com/wangyuchi369/RICO.
中文: RICO框架通过将描述重构为参考图像并识别差异来迭代优化图像描述,提升准确性和完整性,同时RICO-Flash利用DPO提高效率,在多个基准测试中显著优于现有方法。
English: The proposed RICO framework iteratively refines image captions by reconstructing them into reference images and identifying discrepancies to enhance accuracy and completeness, with RICO-Flash optimizing efficiency through DPO, achieving significant improvements over existing methods.

Authors:Ethan Chern, Zhulin Hu, Steffi Chern, Siqi Kou, Jiadi Su, Yan Ma, Zhijie Deng, Pengfei Liu
Title: Thinking with Generated Images
Abstract:
We present Thinking with Generated Images, a novel paradigm that fundamentally transforms how large multimodal models (LMMs) engage with visual reasoning by enabling them to natively think across text and vision modalities through spontaneous generation of intermediate visual thinking steps. Current visual reasoning with LMMs is constrained to either processing fixed user-provided images or reasoning solely through text-based chain-of-thought (CoT). Thinking with Generated Images unlocks a new dimension of cognitive capability where models can actively construct intermediate visual thoughts, critique their own visual hypotheses, and refine them as integral components of their reasoning process. We demonstrate the effectiveness of our approach through two complementary mechanisms: (1) vision generation with intermediate visual subgoals, where models decompose complex visual tasks into manageable components that are generated and integrated progressively, and (2) vision generation with self-critique, where models generate an initial visual hypothesis, analyze its shortcomings through textual reasoning, and produce refined outputs based on their own critiques. Our experiments on vision generation benchmarks show substantial improvements over baseline approaches, with our models achieving up to 50% (from 38% to 57%) relative improvement in handling complex multi-object scenarios. From biochemists exploring novel protein structures, and architects iterating on spatial designs, to forensic analysts reconstructing crime scenes, and basketball players envisioning strategic plays, our approach enables AI models to engage in the kind of visual imagination and iterative refinement that characterizes human creative, analytical, and strategic thinking. We release our open-source suite at https://github.com/GAIR-NLP/thinking-with-generated-images.
中文摘要:本文提出“生成图像思维”新范式,通过让大型多模态模型自主生成并优化中间视觉思维步骤,从根本上改变了视觉推理方式,在复杂场景中实现高达50%的相对性能提升。
English Summary: This paper introduces "Thinking with Generated Images," a paradigm that enhances large multimodal models' visual reasoning by enabling them to spontaneously generate and refine intermediate visual thoughts, achieving up to 50% relative improvement in complex scenarios.

Authors:Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun
Title: Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Abstract:
Improving Multi-modal Large Language Models (MLLMs) in the post-training stage typically relies on supervised fine-tuning (SFT) or reinforcement learning (RL). However, these supervised methods require expensive and manually annotated multi-modal data--an ultimately unsustainable resource. While recent efforts have explored unsupervised post-training, their methods are complex and difficult to iterate. In this work, we are the first to investigate the use of GRPO, a stable and scalable online RL algorithm, for enabling continual self-improvement without any external supervision. We propose MM-UPT, a simple yet effective framework for unsupervised post-training of MLLMs. MM-UPT builds upon GRPO, replacing traditional reward signals with a self-rewarding mechanism based on majority voting over multiple sampled responses. Our experiments demonstrate that MM-UPT significantly improves the reasoning ability of Qwen2.5-VL-7B (e.g., 66.3 %$\rightarrow$72.9 % on MathVista, 62.9 %$\rightarrow$68.7 % on We-Math), using standard dataset without ground truth labels. MM-UPT also outperforms prior unsupervised baselines and even approaches the results of supervised GRPO. Furthermore, we show that incorporating synthetic questions, generated solely by MLLM itself, can boost performance as well, highlighting a promising approach for scalable self-improvement. Overall, MM-UPT offers a new paradigm for continual, autonomous enhancement of MLLMs in the absence of external supervision. Our code is available at https://github.com/waltonfuture/MM-UPT.
中文: MM-UPT提出了一种基于GRPO和无监督自奖励机制的后训练框架,无需外部监督即可显著提升多模态大语言模型的推理能力,其效果甚至接近有监督方法。
English: MM-UPT introduces an unsupervised post-training framework using GRPO with a self-rewarding mechanism, significantly enhancing MLLMs' reasoning without external supervision and even approaching supervised method performance.

Authors:Yao Huang, Huanran Chen, Shouwei Ruan, Yichi Zhang, Xingxing Wei, Yinpeng Dong
Title: Mitigating Overthinking in Large Reasoning Models via Manifold Steering
Abstract:
Recent advances in Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex tasks such as mathematics and coding. However, these models frequently exhibit a phenomenon known as overthinking during inference, characterized by excessive validation loops and redundant deliberation, leading to substantial computational overheads. In this paper, we aim to mitigate overthinking by investigating the underlying mechanisms from the perspective of mechanistic interpretability. We first showcase that the tendency of overthinking can be effectively captured by a single direction in the model's activation space and the issue can be eased by intervening the activations along this direction. However, this efficacy soon reaches a plateau and even deteriorates as the intervention strength increases. We therefore systematically explore the activation space and find that the overthinking phenomenon is actually tied to a low-dimensional manifold, which indicates that the limited effect stems from the noises introduced by the high-dimensional steering direction. Based on this insight, we propose Manifold Steering, a novel approach that elegantly projects the steering direction onto the low-dimensional activation manifold given the theoretical approximation of the interference noise. Extensive experiments on DeepSeek-R1 distilled models validate that our method reduces output tokens by up to 71% while maintaining and even improving the accuracy on several mathematical benchmarks. Our method also exhibits robust cross-domain transferability, delivering consistent token reduction performance in code generation and knowledge-based QA tasks. Code is available at: https://github.com/Aries-iai/Manifold_Steering.
中文: 本文提出流形导向方法,通过将激活干预投影到低维流形上减少大型推理模型的计算开销,在数学和编程任务中最多减少71%的生成标记数,同时保持甚至提高准确性。
English: This paper introduces Manifold Steering, a method that reduces computational overhead in Large Reasoning Models by projecting activation interventions onto a low-dimensional manifold, achieving up to 71% fewer tokens while maintaining or improving accuracy across mathematical and coding tasks.

Authors:Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Title: Text2Grad: Reinforcement Learning from Natural Language Feedback
Abstract:
Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answer while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results demonstrate that natural-language feedback, when converted to gradients, is a powerful signal for fine-grained policy optimization. The code for our method is available at https://github.com/microsoft/Text2Grad
中文摘要:Text2Grad提出了一种创新的强化学习方法,将自由形式的文本反馈转化为精确的片段级梯度,在多个任务中超越传统标量奖励方法的同时,实现了细粒度的模型优化并增强了可解释性。
English Summary: Text2Grad introduces a novel reinforcement learning approach that converts free-form textual critiques into precise span-level gradients, enabling fine-grained model optimization that outperforms traditional scalar-reward methods across multiple tasks while providing enhanced interpretability.

Authors:Lai Wei, Yuting Li, Kaipeng Zheng, Chen Wang, Yue Wang, Linghe Kong, Lichao Sun, Weiran Huang
Title: Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start
Abstract:
Recent advancements in large language models (LLMs) have demonstrated impressive chain-of-thought reasoning capabilities, with reinforcement learning (RL) playing a crucial role in this progress. While "aha moment" patterns--where models exhibit self-correction through reflection--are often attributed to emergent properties from RL, we first demonstrate that these patterns exist in multimodal LLMs (MLLMs) prior to RL training but may not necessarily correlate with improved reasoning performance. Building on these insights, we present a comprehensive study on enhancing multimodal reasoning through a two-stage approach: (1) supervised fine-tuning (SFT) as a cold start with structured chain-of-thought reasoning patterns, followed by (2) reinforcement learning via GRPO to further refine these capabilities. Our extensive experiments show that this combined approach consistently outperforms both SFT-only and RL-only methods across challenging multimodal reasoning benchmarks. The resulting models achieve state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with our 7B model showing substantial improvements over base models (e.g., 66.3 %$\rightarrow$73.4 % on MathVista, 62.9 %$\rightarrow$70.4 % on We-Math) and our 3B model achieving performance competitive with several 7B models. Overall, this work provides practical guidance for building advanced multimodal reasoning models. Our code is available at https://github.com/waltonfuture/RL-with-Cold-Start.
中文: 本研究表明多模态大语言模型中的自我修正模式在强化学习训练前就已存在,并提出结合监督微调与强化学习的两阶段方法,在推理基准测试中实现了最优性能。
English: This study demonstrates that self-correction patterns in multimodal LLMs exist before RL training and proposes a two-stage approach combining supervised fine-tuning and reinforcement learning, achieving state-of-the-art performance on reasoning benchmarks.

Authors:Haosheng Zou, Xiaowei Lv, Shousheng Jia, Xiangzheng Zhang
Title: 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training
Abstract:
Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies' training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.
Chinese: 我们开源了集成序列并行技术的360-LLA MA-Factory,该框架已在多个模型及企业训练系统中获得广泛应用。
English: We have open-sourced 360-LLaMA-Factory with sequence parallelism, which has been widely adopted in various models and corporate training frameworks.

Authors:Xuchen Ma, Jianxiang Yu, Wenming Shao, Bo Pang, Xiang Li
Title: Breaking the Cloak! Unveiling Chinese Cloaked Toxicity with Homophone Graph and Toxic Lexicon
Abstract:
Social media platforms have experienced a significant rise in toxic content, including abusive language and discriminatory remarks, presenting growing challenges for content moderation. Some users evade censorship by deliberately disguising toxic words through homophonic cloak, which necessitates the task of unveiling cloaked toxicity. Existing methods are mostly designed for English texts, while Chinese cloaked toxicity unveiling has not been solved yet. To tackle the issue, we propose C$^2$TU, a novel training-free and prompt-free method for Chinese cloaked toxic content unveiling. It first employs substring matching to identify candidate toxic words based on Chinese homo-graph and toxic lexicon. Then it filters those candidates that are non-toxic and corrects cloaks to be their corresponding toxicities. Specifically, we develop two model variants for filtering, which are based on BERT and LLMs, respectively. For LLMs, we address the auto-regressive limitation in computing word occurrence probability and utilize the full semantic contexts of a text sequence to reveal cloaked toxic words. Extensive experiments demonstrate that C$^2$TU can achieve superior performance on two Chinese toxic datasets. In particular, our method outperforms the best competitor by up to 71% on the F1 score and 35% on accuracy, respectively. Our code and data are available at https://github.com/XDxc-cuber/C2TU-Chinese-cloaked-toxicity-unveiling.
中文摘要:本研究提出C$^2$TU方法,通过识别中文同音替换词并结合BERT与大语言模型过滤非毒性内容,有效检测中文伪装毒性文本,在多项指标上显著超越现有最佳方法。
English Summary: The study introduces C$^2$TU, a training-free method for detecting disguised toxic content in Chinese text by identifying homophonic substitutions and filtering non-toxic candidates using BERT and LLMs, achieving significant performance improvements over existing approaches.

Authors:Yudi Zhang, Weilin Zhao, Xu Han, Tiejun Zhao, Wang Xu, Hailong Cao, Conghui Zhu
Title: Speculative Decoding Meets Quantization: Compatibility Evaluation and Hierarchical Framework Design
Abstract:
Speculative decoding and quantization effectively accelerate memory-bound inference of large language models. Speculative decoding mitigates the memory bandwidth bottleneck by verifying multiple tokens within a single forward pass, which increases computational effort. Quantization achieves this optimization by compressing weights and activations into lower bit-widths and also reduces computations via low-bit matrix multiplications. To further leverage their strengths, we investigate the integration of these two techniques. Surprisingly, experiments applying the advanced speculative decoding method EAGLE-2 to various quantized models reveal that the memory benefits from 4-bit weight quantization are diminished by the computational load from speculative decoding. Specifically, verifying a tree-style draft incurs significantly more time overhead than a single-token forward pass on 4-bit weight quantized models. This finding led to our new speculative decoding design: a hierarchical framework that employs a small model as an intermediate stage to turn tree-style drafts into sequence drafts, leveraging the memory access benefits of the target quantized model. Experimental results show that our hierarchical approach achieves a 2.78$\times$ speedup across various tasks for the 4-bit weight Llama-3-70B model on an A100 GPU, outperforming EAGLE-2 by 1.31$\times$. Code available at https://github.com/AI9Stars/SpecMQuant.
中文总结:推测解码与量化相结合可加速大语言模型推理,但实验发现4位量化的内存优势会被推测解码的计算负载抵消,因此提出分层框架,通过小模型将树状草案转为序列草案,显著提升量化模型性能。
English Summary: Speculative decoding and quantization are combined to accelerate large language model inference, but their integration reveals that 4-bit quantization's memory benefits are offset by computational overhead, leading to a new hierarchical framework that achieves significant speedup improvements.

Authors:Vihang Pancholi, Jainit Bafna, Tejas Anvekar, Manish Shrivastava, Vivek Gupta
Title: TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation
Abstract:
Evaluating tables qualitatively and quantitatively poses a significant challenge, as standard metrics often overlook subtle structural and content-level discrepancies. To address this, we propose a rubric-based evaluation framework that integrates multi-level structural descriptors with fine-grained contextual signals, enabling more precise and consistent table comparison. Building on this, we introduce TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval first aligns reference and predicted tables structurally via TabAlign, then performs semantic and syntactic comparison using TabCompare, offering interpretable and granular feedback. We evaluate TabXEval on TabXBench, a diverse, multi-domain benchmark featuring realistic table perturbations and human annotations. A sensitivity-specificity analysis further demonstrates the robustness and explainability of TabXEval across varied table tasks. Code and data are available at https://coral-lab-asu.github.io/tabxeval/

Authors:Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun
Title: EULER: Enhancing the Reasoning Ability of Large Language Models through Error-Induced Learning
Abstract:
Large Language Models (LLMs) have demonstrated strong reasoning capabilities and achieved promising results in mathematical problem-solving tasks. Learning from errors offers the potential to further enhance the performance of LLMs during Supervised Fine-Tuning (SFT). However, the errors in synthesized solutions are typically gathered from sampling trails, making it challenging to generate solution errors for each mathematical problem. This paper introduces the Error-IndUced LEaRning (EULER) model, which aims to develop an error exposure model that generates high-quality solution errors to enhance the mathematical reasoning capabilities of LLMs. Specifically, EULER optimizes the error exposure model to increase the generation probability of self-made solution errors while utilizing solutions produced by a superior LLM to regularize the generation quality. Our experiments across various mathematical problem datasets demonstrate the effectiveness of the EULER model, achieving an improvement of over 4% compared to all baseline models. Further analysis reveals that EULER is capable of synthesizing more challenging and educational solution errors, which facilitate both the training and inference processes of LLMs. All codes are available at https://github.com/NEUIR/EULER.
中文: EULER模型通过生成高质量解题错误来增强大语言模型的数学推理能力,在监督微调中实现超过4%的性能提升。
English: The EULER model enhances LLMs' mathematical reasoning by generating high-quality solution errors during supervised fine-tuning, achieving over 4% improvement across datasets.

Authors:Runyu Wang, Peng Ping, Zhengyu Guo, Xiaoye Zhang, Quan Shi, Liting Zhou, Tianbo Ji
Title: LoKI: Low-damage Knowledge Implanting of Large Language Models
Abstract:
Fine-tuning adapts pretrained models for specific tasks but poses the risk of catastrophic forgetting (CF), where critical knowledge from pre-training is overwritten. Current Parameter-Efficient Fine-Tuning (PEFT) methods for Large Language Models (LLMs), while efficient, often sacrifice general capabilities. To address the issue of CF in a general-purpose PEFT framework, we propose \textbf{Lo}w-damage \textbf{K}nowledge \textbf{I}mplanting (\textbf{LoKI}), a PEFT technique that is based on a mechanistic understanding of how knowledge is stored in transformer architectures. In two real-world scenarios, LoKI demonstrates task-specific performance that is comparable to or even surpasses that of full fine-tuning and LoRA-based methods across various model types, while significantly better preserving general capabilities. Our work connects mechanistic insights into LLM knowledge storage with practical fine-tuning objectives, achieving state-of-the-art trade-offs between task specialization and the preservation of general capabilities. Our implementation is publicly available as ready-to-use code\footnote{https://github.com/Nexround/LoKI}.
中文: LoKI是一种基于Transformer知识存储机制理解的参数高效微调方法,通过低损伤知识植入技术,在保证任务性能的同时显著减少灾难性遗忘,有效平衡了专业化与通用能力。
English: LoKI is a parameter-efficient fine-tuning method that mitigates catastrophic forgetting by leveraging mechanistic insights into transformer knowledge storage, achieving superior task performance while preserving general capabilities across various models.

Authors:Jintao Zhang, Zirui Liu, Mingyue Cheng, Shilong Zhang, Tingyue Pan, Yitong zhou, Qi Liu, Yanhu Xie
Title: Multimodal Forecasting of Sparse Intraoperative Hypotension Events Powered by Language Model
Abstract:
Intraoperative hypotension (IOH) frequently occurs under general anesthesia and is strongly linked to adverse outcomes such as myocardial injury and increased mortality. Despite its significance, IOH prediction is hindered by event sparsity and the challenge of integrating static and dynamic data across diverse patients. In this paper, we propose \textbf{IOHFuseLM}, a multimodal language model framework. To accurately identify and differentiate sparse hypotensive events, we leverage a two-stage training strategy. The first stage involves domain adaptive pretraining on IOH physiological time series augmented through diffusion methods, thereby enhancing the model sensitivity to patterns associated with hypotension. Subsequently, task fine-tuning is performed on the original clinical dataset to further enhance the ability to distinguish normotensive from hypotensive states. To enable multimodal fusion for each patient, we align structured clinical descriptions with the corresponding physiological time series at the token level. Such alignment enables the model to capture individualized temporal patterns alongside their corresponding clinical semantics. In addition, we convert static patient attributes into structured text to enrich personalized information. Experimental evaluations on two intraoperative datasets demonstrate that IOHFuseLM outperforms established baselines in accurately identifying IOH events, highlighting its applicability in clinical decision support scenarios. Our code is publicly available to promote reproducibility at https://github.com/zjt-gpu/IOHFuseLM.
中文:IOHFuseLM框架采用多模态语言模型和两阶段训练策略,通过整合静态与动态患者数据精确预测术中低血压,在临床评估中展现出卓越性能。
English: The IOHFuseLM framework uses a multimodal language model with a two-stage training strategy to accurately predict intraoperative hypotension by integrating static and dynamic patient data, demonstrating superior performance in clinical evaluations.

Authors:Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Title: Curse of High Dimensionality Issue in Transformer for Long-context Modeling
Abstract:
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.
中文: Transformer架构的大型语言模型因冗余注意力计算存在效率问题,本文提出的动态分组注意力(DGA)方法通过聚合次要标记显著降低计算成本,同时保持模型性能竞争力。
English: Transformer-based LLMs face computational inefficiency from redundant attention computations, which the proposed Dynamic Group Attention (DGA) method addresses by aggregating less important tokens to reduce costs while maintaining performance.

Authors:Ran Li, Shimin Di, Yuchen Liu, Chen Jing, Yu Qiu, Lei Chen
Title: Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO
Abstract:
Previous study suggest that powerful Large Language Models (LLMs) trained with Reinforcement Learning with Verifiable Rewards (RLVR) only refines reasoning path without improving the reasoning capacity in math tasks while supervised-finetuning(SFT) with distillation can. We study this from the view of Scientific information extraction (SciIE) where LLMs and reasoning LLMs underperforms small Bert-based models. SciIE require both the reasoning and memorization. We argue that both SFT and RLVR can refine the reasoning path and improve reasoning capacity in a simple way based on SciIE. We propose two-stage training with 1. MimicSFT, using structured reasoning templates without needing high-quality chain-of-thought data, 2. R$^2$GRPO with relevance and rule-induced rewards. Experiments on scientific IE benchmarks show that both methods can improve the reasoning capacity. R$^2$GRPO with mimicSFT surpasses baseline LLMs and specialized supervised models in relation extraction. Our code is available at https://github.com/ranlislz/R2GRPO.
中文摘要:研究表明,结合监督微调(MimicSFT)和强化学习(R²GRPO)能提升科学信息抽取中的推理能力,超越基线模型表现。
English Summary: The study demonstrates that combining supervised fine-tuning (MimicSFT) and reinforcement learning (R²GRPO) enhances reasoning capacity in scientific information extraction, outperforming baseline models.

Authors:Qiuchen Wang, Ruixue Ding, Yu Zeng, Zehui Chen, Lin Chen, Shihang Wang, Pengjun Xie, Fei Huang, Feng Zhao
Title: VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
Abstract:
Effectively retrieving, reasoning and understanding visually rich information remains a challenge for RAG methods. Traditional text-based methods cannot handle visual-related information. On the other hand, current vision-based RAG approaches are often limited by fixed pipelines and frequently struggle to reason effectively due to the insufficient activation of the fundamental capabilities of models. As RL has been proven to be beneficial for model reasoning, we introduce VRAG-RL, a novel RL framework tailored for complex reasoning across visually rich information. With this framework, VLMs interact with search engines, autonomously sampling single-turn or multi-turn reasoning trajectories with the help of visual perception tokens and undergoing continual optimization based on these samples. Our approach highlights key limitations of RL in RAG domains: (i) Prior Multi-modal RAG approaches tend to merely incorporate images into the context, leading to insufficient reasoning token allocation and neglecting visual-specific perception; and (ii) When models interact with search engines, their queries often fail to retrieve relevant information due to the inability to articulate requirements, thereby leading to suboptimal performance. To address these challenges, we define an action space tailored for visually rich inputs, with actions including cropping and scaling, allowing the model to gather information from a coarse-to-fine perspective. Furthermore, to bridge the gap between users' original inquiries and the retriever, we employ a simple yet effective reward that integrates query rewriting and retrieval performance with a model-based reward. Our VRAG-RL optimizes VLMs for RAG tasks using specially designed RL strategies, aligning the model with real-world applications. The code is available at https://github.com/Alibaba-NLP/VRAG.
中文摘要:VRAG-RL框架通过让视觉语言模型自主采样并优化推理轨迹,结合交互式搜索引擎查询和视觉感知标记,解决了当前视觉RAG方法在复杂视觉信息推理中的固有限制。
English Summary: The VRAG-RL framework addresses limitations in visual reasoning for RAG systems by enabling VLMs to autonomously sample and optimize reasoning trajectories through interactive search engine queries and visual perception tokens.

Authors:Ruicheng Yin, Xuan Gao, Changze Lv, Xiaohua Wang, Xiaoqing Zheng, Xuanjing Huang
Title: Improving Continual Pre-training Through Seamless Data Packing
Abstract:
Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input texts and splitting them into fixed-length sequences. While straightforward and efficient, this method often leads to excessive truncation and context discontinuity, which can hinder model performance. To address these issues, we explore the potential of data engineering to enhance continual pre-training, particularly its impact on model performance and efficiency. We propose Seamless Packing (SP), a novel data packing strategy aimed at preserving contextual information more effectively and enhancing model performance. Our approach employs a sliding window technique in the first stage that synchronizes overlapping tokens across consecutive sequences, ensuring better continuity and contextual coherence. In the second stage, we adopt a First-Fit-Decreasing algorithm to pack shorter texts into bins slightly larger than the target sequence length, thereby minimizing padding and truncation. Empirical evaluations across various model architectures and corpus domains demonstrate the effectiveness of our method, outperforming baseline method in 99% of all settings. Code is available at https://github.com/Infernus-WIND/Seamless-Packing.
中文: Seamless Packing (SP) 是一种新颖的数据打包策略,通过滑动窗口和首次适应递减算法在持续预训练中有效保持上下文连贯性,在99%的评估场景中显著超越基线方法。
English: Seamless Packing (SP) is a novel data packing strategy that uses a sliding window and First-Fit-Decreasing algorithm to preserve contextual integrity during continual pre-training, significantly outperforming baseline methods in 99% of evaluations.

Authors:Fakhraddin Alwajih, Samar Mohamed Magdy, Abdellah El Mekki, Omer Nacar, Youssef Nafea, Safaa Taher Abdelfadil, Abdulfattah Mohammed Yahya, Hamzah Luqman, Nada Almarwani, Samah Aloufi, Baraah Qawasmeh, Houdaifa Atou, Serry Sibaee, Hamzah A. Alsayadi, Walid Al-Dhabyani, Maged S. Al-shaibani, Aya El aatar, Nour Qandos, Rahaf Alhamouri, Samar Ahmad, Razan Khassib, Lina Hamad, Mohammed Anwar AL-Ghrawi, Fatimah Alshamari, Cheikh Malainine, Doaa Qawasmeh, Aminetou Yacoub, Tfeil moilid, Ruwa AbuHweidi, Ahmed Aboeitta, Vatimetou Mohamed Lemin, Reem Abdel-Salam, Ahlam Bashiti, Adel Ammar, Aisha Alansari, Ahmed Ashraf, Nora Alturayeif, Sara Shatnawi, Alcides Alcoba Inciarte, AbdelRahim A. Elmadany, Mohamedou cheikh tourad, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Title: Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset
Abstract:
Mainstream large vision-language models (LVLMs) inherently encode cultural biases, highlighting the need for diverse multimodal datasets. To address this gap, we introduce Pearl, a large-scale Arabic multimodal dataset and benchmark explicitly designed for cultural understanding. Constructed through advanced agentic workflows and extensive human-in-the-loop annotations by 45 annotators from across the Arab world, Pearl comprises over K multimodal examples spanning ten culturally significant domains covering all Arab countries. We further provide two robust evaluation benchmarks Pearl and Pearl-Lite along with a specialized subset Pearl-X explicitly developed to assess nuanced cultural variations. Comprehensive evaluations on state-of-the-art open and proprietary LVLMs demonstrate that reasoning-centric instruction alignment substantially improves models' cultural grounding compared to conventional scaling methods. Pearl establishes a foundational resource for advancing culturally-informed multimodal modeling research. All datasets and benchmarks are publicly available.
中文: 为解决主流视觉语言模型中的文化偏见问题,PEARL数据集作为包含30.9万条阿拉伯文化多模态样本的基准资源应运而生,其评估表明以推理为核心的指令对齐方法相比传统扩展方式能更有效提升模型的文化认知能力。
English: To address cultural biases in large vision-language models, the PEARL dataset was developed as a comprehensive Arabic multimodal resource with over 309K culturally-grounded examples, demonstrating that instruction-focused alignment significantly enhances cultural understanding compared to standard scaling approaches.

Authors:Aditya Gunturu, Ben Pearman, Keiichi Ihara, Morteza Faraji, Bryan Wang, Rubaiat Habib Kazi, Ryo Suzuki
Title: MapStory: Prototyping Editable Map Animations with LLM Agents
Abstract:
We introduce MapStory, an LLM-powered animation prototyping tool that generates editable map animation sequences directly from natural language text by leveraging a dual-agent LLM architecture. Given a user written script, MapStory automatically produces a scene breakdown, which decomposes the text into key map animation primitives such as camera movements, visual highlights, and animated elements. Our system includes a researcher agent that accurately queries geospatial information by leveraging an LLM with web search, enabling automatic extraction of relevant regions, paths, and coordinates while allowing users to edit and query for changes or additional information to refine the results. Additionally, users can fine-tune parameters of these primitive blocks through an interactive timeline editor. We detail the system's design and architecture, informed by formative interviews with professional animators and by an analysis of 200 existing map animation videos. Our evaluation, which includes expert interviews (N=5) and a usability study (N=12), demonstrates that MapStory enables users to create map animations with ease, facilitates faster iteration, encourages creative exploration, and lowers barriers to creating map-centric stories.

Authors:Tianyu Guo, Hande Dong, Yichong Leng, Feng Liu, Cheater Lin, Nong Xiao, Xianwei Zhang
Title: EFIM: Efficient Serving of LLMs for Infilling Tasks with Improved KV Cache Reuse
Abstract:
Large language models (LLMs) are often used for infilling tasks, which involve predicting or generating missing information in a given text. These tasks typically require multiple interactions with similar context. To reduce the computation of repeated historical tokens, cross-request key-value (KV) cache reuse, a technique that stores and reuses intermediate computations, has become a crucial method in multi-round interactive services. However, in infilling tasks, the KV cache reuse is often hindered by the structure of the prompt format, which typically consists of a prefix and suffix relative to the insertion point. Specifically, the KV cache of the prefix or suffix part is frequently invalidated as the other part (suffix or prefix) is incrementally generated. To address the issue, we propose EFIM, a transformed prompt format of FIM to unleash the performance potential of KV cache reuse. Although the transformed prompt can solve the inefficiency, it exposes subtoken generation problems in current LLMs, where they have difficulty generating partial words accurately. Therefore, we introduce a fragment tokenization training method which splits text into multiple fragments before tokenization during data processing. Experiments on two representative LLMs show that LLM serving with EFIM can lower the latency by 52% and improve the throughput by 98% while maintaining the original infilling capability. EFIM's source code is publicly available at https://github.com/gty111/EFIM.
中文摘要:EFIM通过改进提示格式和引入分段标记化训练,有效提升大语言模型中KV缓存的复用效率,在保持原有填充能力的同时,将延迟降低52%、吞吐量提高98%。
English Summary: EFIM introduces a transformed prompt format and fragment tokenization training to enhance KV cache reuse in large language models, significantly reducing latency by 52% and boosting throughput by 98% while preserving infilling performance.

Authors:Chengyue Huang, Brisa Maneechotesuwan, Shivang Chopra, Zsolt Kira
Title: FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Abstract:
Visual question answering (VQA) systems face significant challenges when adapting to real-world data shifts, especially in multi-modal contexts. While robust fine-tuning strategies are essential for maintaining performance across in-distribution (ID) and out-of-distribution (OOD) scenarios, current evaluation settings are primarily unimodal or particular to some types of OOD, offering limited insight into the complexities of multi-modal contexts. In this work, we propose a new benchmark FRAMES-VQA (Fine-Tuning Robustness across Multi-Modal Shifts in VQA) for evaluating robust fine-tuning for VQA tasks. We utilize ten existing VQA benchmarks, including VQAv2, IV-VQA, VQA-CP, OK-VQA and others, and categorize them into ID, near and far OOD datasets covering uni-modal, multi-modal and adversarial distribution shifts. We first conduct a comprehensive comparison of existing robust fine-tuning methods. We then quantify the distribution shifts by calculating the Mahalanobis distance using uni-modal and multi-modal embeddings extracted from various models. Further, we perform an extensive analysis to explore the interactions between uni- and multi-modal shifts as well as modality importance for ID and OOD samples. These analyses offer valuable guidance on developing more robust fine-tuning methods to handle multi-modal distribution shifts. The code is available at https://github.com/chengyuehuang511/FRAMES-VQA .
中文: 本文提出了FRAMES-VQA新基准,用于评估视觉问答任务中针对多模态分布变化的鲁棒微调方法,通过将数据集分类为分布内和分布外类型并分析模态交互作用,为开发更鲁棒的微调方法提供指导。
English: This paper introduces FRAMES-VQA, a new benchmark for evaluating robust fine-tuning in visual question answering across multi-modal distribution shifts, categorizing datasets into in-distribution and out-of-distribution types while analyzing modality interactions to guide future method development.

Authors:Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Chuchu Fan
Title: R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Abstract:
Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
中文摘要:本文提出R1-Code-Interpreter,通过多阶段课程学习方法训练大语言模型跨领域自主生成代码查询,在多项测试任务中显著超越了GPT-4o模型的性能表现。
English Summary: This paper introduces R1-Code-Interpreter, a multi-stage curriculum learning approach that trains LLMs to autonomously generate code queries across diverse tasks, achieving significant performance improvements over GPT-4o models.

Authors:Miao Peng, Nuo Chen, Jianheng Tang, Jia Li
Title: How does Misinformation Affect Large Language Model Behaviors and Preferences?
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in knowledge-intensive tasks, while they remain vulnerable when encountering misinformation. Existing studies have explored the role of LLMs in combating misinformation, but there is still a lack of fine-grained analysis on the specific aspects and extent to which LLMs are influenced by misinformation. To bridge this gap, we present MisBench, the current largest and most comprehensive benchmark for evaluating LLMs' behavior and knowledge preference toward misinformation. MisBench consists of 10,346,712 pieces of misinformation, which uniquely considers both knowledge-based conflicts and stylistic variations in misinformation. Empirical results reveal that while LLMs demonstrate comparable abilities in discerning misinformation, they still remain susceptible to knowledge conflicts and stylistic variations. Based on these findings, we further propose a novel approach called Reconstruct to Discriminate (RtD) to strengthen LLMs' ability to detect misinformation. Our study provides valuable insights into LLMs' interactions with misinformation, and we believe MisBench can serve as an effective benchmark for evaluating LLM-based detectors and enhancing their reliability in real-world applications. Codes and data are available at https://github.com/GKNL/MisBench.
Chinese: 大型语言模型在知识冲突和风格变化方面对错误信息表现出显著脆弱性,为此我们构建了MisBench这一全面基准,通过提出的重构判别方法(RtD)来评估并增强其检测能力。
English: Large Language Models (LLMs) demonstrate significant vulnerability to misinformation, particularly in knowledge conflicts and stylistic variations, leading to the creation of MisBench—a comprehensive benchmark to evaluate and enhance their detection capabilities through the proposed Reconstruct to Discriminate (RtD) method.

Authors:Tianyu Fu, Yi Ge, Yichen You, Enshu Liu, Zhihang Yuan, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Title: R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing
Abstract:
Large Language Models (LLMs) achieve impressive reasoning capabilities at the cost of substantial inference overhead, posing substantial deployment challenges. Although distilled Small Language Models (SLMs) significantly enhance efficiency, their performance suffers as they fail to follow LLMs' reasoning paths. Luckily, we reveal that only a small fraction of tokens genuinely diverge reasoning paths between LLMs and SLMs. Most generated tokens are either identical or exhibit neutral differences, such as minor variations in abbreviations or expressions. Leveraging this insight, we introduce **Roads to Rome (R2R)**, a neural token routing method that selectively utilizes LLMs only for these critical, path-divergent tokens, while leaving the majority of token generation to the SLM. We also develop an automatic data generation pipeline that identifies divergent tokens and generates token-level routing labels to train the lightweight router. We apply R2R to combine R1-1.5B and R1-32B models from the DeepSeek family, and evaluate on challenging math, coding, and QA benchmarks. With an average activated parameter size of 5.6B, R2R surpasses the average accuracy of R1-7B by 1.6x, outperforming even the R1-14B model. Compared to R1-32B, it delivers a 2.8x wall-clock speedup with comparable performance, advancing the Pareto frontier of test-time scaling efficiency. Our code is available at https://github.com/thu-nics/R2R.
中文: Roads to Rome (R2R)方法通过神经令牌路由技术,仅将关键推理令牌交由大模型处理,其余生成任务由小模型承担,在保持相当性能的同时实现2.8倍加速,显著提升了推理效率边界。
English: The Roads to Rome (R2R) method selectively routes only critical reasoning tokens to large language models while delegating most generation to small models, achieving comparable performance with 2.8x speedup and advancing efficiency frontiers.

Authors:Dingming Li, Hongxing Li, Zixuan Wang, Yuchen Yan, Hang Zhang, Siqi Chen, Guiyang Hou, Shengpei Jiang, Wenqi Zhang, Yongliang Shen, Weiming Lu, Yueting Zhuang
Title: ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models
Abstract:
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content, but significant challenges persist in tasks requiring cross-viewpoint understanding and spatial reasoning. We identify a critical limitation: current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints when required to adopt another entity's spatial frame of reference. We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation across five distinct task types, supported by an automated 3D annotation pipeline that generates precise directional labels. Comprehensive evaluation of diverse VLMs on ViewSpatial-Bench reveals a significant performance disparity: models demonstrate reasonable performance on camera-perspective tasks but exhibit reduced accuracy when reasoning from a human viewpoint. By fine-tuning VLMs on our multi-perspective spatial dataset, we achieve an overall performance improvement of 46.24% across tasks, highlighting the efficacy of our approach. Our work establishes a crucial benchmark for spatial intelligence in embodied AI systems and provides empirical evidence that modeling 3D spatial relationships enhances VLMs' corresponding spatial comprehension capabilities.
中文摘要:视觉语言模型在全中心空间推理方面存在不足,但通过多视角数据集微调后性能显著提升,ViewSpatial-Bench基准测试验证了这一突破。
English Summary: Vision-language models struggle with allocentric spatial reasoning but show significant improvement when fine-tuned on multi-perspective datasets, as demonstrated by the new ViewSpatial-Bench benchmark.

Authors:Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, Philip Torr
Title: Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Abstract:
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a)Parser distills the paper into a structured asset library; the (b)Planner aligns text-visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c)Painter-Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs-though visually appealing at first glance-often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g. based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable .pptx poster - all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models. The code and datasets are available at https://github.com/Paper2Poster/Paper2Poster.
中文总结:本研究提出了首个学术海报生成基准和PosterAgent流程,在视觉文本连贯性和成本效益上超越现有方法,同时揭示了自动化设计中的关键瓶颈。
English Summary: This research introduces the first benchmark and PosterAgent pipeline for academic poster generation, which outperforms existing methods in visual-textual coherence and cost-efficiency while identifying key bottlenecks in automated design.

Authors:Han Xiao, Guozhi Wang, Yuxiang Chai, Zimu Lu, Weifeng Lin, Hao He, Lue Fan, Liuyang Bian, Rui Hu, Liang Liu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Aojun Zhou, Hongsheng Li
Title: UI-Genie: A Self-Improving Approach for Iteratively Boosting MLLM-based Mobile GUI Agents
Abstract:
In this paper, we introduce UI-Genie, a self-improving framework addressing two key challenges in GUI agents: verification of trajectory outcome is challenging and high-quality training data are not scalable. These challenges are addressed by a reward model and a self-improving pipeline, respectively. The reward model, UI-Genie-RM, features an image-text interleaved architecture that efficiently pro- cesses historical context and unifies action-level and task-level rewards. To sup- port the training of UI-Genie-RM, we develop deliberately-designed data genera- tion strategies including rule-based verification, controlled trajectory corruption, and hard negative mining. To address the second challenge, a self-improvement pipeline progressively expands solvable complex GUI tasks by enhancing both the agent and reward models through reward-guided exploration and outcome verification in dynamic environments. For training the model, we generate UI- Genie-RM-517k and UI-Genie-Agent-16k, establishing the first reward-specific dataset for GUI agents while demonstrating high-quality synthetic trajectory gen- eration without manual annotation. Experimental results show that UI-Genie achieves state-of-the-art performance across multiple GUI agent benchmarks with three generations of data-model self-improvement. We open-source our complete framework implementation and generated datasets to facilitate further research in https://github.com/Euphoria16/UI-Genie.
中文: 本文提出UI-Genie自改进框架,通过专用奖励模型和自动化数据生成解决GUI智能体的两大难题,在无需人工标注的情况下实现了最优性能。
English: This paper presents UI-Genie, a self-improving framework that tackles GUI agent challenges through a specialized reward model and automated data generation, achieving state-of-the-art results without manual annotation.

Authors:Xiangxin Zhou, Zichen Liu, Anya Sims, Haonan Wang, Tianyu Pang, Chongxuan Li, Liang Wang, Min Lin, Chao Du
Title: Reinforcing General Reasoning without Verifiers
Abstract:
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Chinese: 提出的VeriFree方法通过直接最大化生成参考答案的概率,绕过了强化学习中的答案验证环节,在保持与验证器方法相当甚至更优性能的同时,显著提升了实用性和计算效率。
English: The proposed VeriFree method eliminates the need for answer verification in reinforcement learning by directly maximizing the probability of generating reference answers, demonstrating comparable or superior performance to verifier-based approaches while offering significant practical and computational advantages.

Authors:Keenan Samway, Max Kleiman-Weiner, David Guzman Piedrahita, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Title: Are Language Models Consequentialist or Deontological Moral Reasoners?
Abstract:
As AI systems increasingly navigate applications in healthcare, law, and governance, understanding how they handle ethically complex scenarios becomes critical. Previous work has mainly examined the moral judgments in large language models (LLMs), rather than their underlying moral reasoning process. In contrast, we focus on a large-scale analysis of the moral reasoning traces provided by LLMs. Furthermore, unlike prior work that attempted to draw inferences from only a handful of moral dilemmas, our study leverages over 600 distinct trolley problems as probes for revealing the reasoning patterns that emerge within different LLMs. We introduce and test a taxonomy of moral rationales to systematically classify reasoning traces according to two main normative ethical theories: consequentialism and deontology. Our analysis reveals that LLM chains-of-thought tend to favor deontological principles based on moral obligations, while post-hoc explanations shift notably toward consequentialist rationales that emphasize utility. Our framework provides a foundation for understanding how LLMs process and articulate ethical considerations, an important step toward safe and interpretable deployment of LLMs in high-stakes decision-making environments. Our code is available at https://github.com/keenansamway/moral-lens .
中文摘要:本研究通过600多个电车难题对大型语言模型进行大规模道德推理分析,发现其思维链推理倾向于义务论原则,而事后解释则明显转向强调效用的功利主义理据。
English Summary: This study conducts a large-scale analysis of moral reasoning in large language models using over 600 trolley problems, revealing that while their chain-of-thought reasoning favors deontological principles, their post-hoc explanations shift toward consequentialist rationales.

Authors:Zijun Liu, Zhennan Wan, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Title: Scaling External Knowledge Input Beyond Context Windows of LLMs via Multi-Agent Collaboration
Abstract:
With the rapid advancement of post-training techniques for reasoning and information seeking, large language models (LLMs) can incorporate a large quantity of retrieved knowledge to solve complex tasks. However, the limited context window of LLMs obstructs scaling the amount of external knowledge input, prohibiting further improvement, especially for tasks requiring significant amount of external knowledge. Existing context window extension methods inevitably cause information loss. LLM-based multi-agent methods emerge as a new paradigm to handle massive input in a distributional manner, where we identify two core bottlenecks in existing knowledge synchronization and reasoning processes. In this work, we develop a multi-agent framework, $\textbf{ExtAgents}$, to overcome the bottlenecks and enable better scalability in inference-time knowledge integration without longer-context training. Benchmarked with our enhanced multi-hop question answering test, $\textbf{$\boldsymbol{\infty}$Bench+}$, and other public test sets including long survey generation, ExtAgents significantly enhances the performance over existing non-training methods with the same amount of external knowledge input, regardless of whether it falls $\textit{within or exceeds the context window}$. Moreover, the method maintains high efficiency due to high parallelism. Further study in the coordination of LLM agents on increasing external knowledge input could benefit real-world applications.
中文:ExtAgents多智能体框架通过分布式知识集成突破了上下文窗口限制,无需长文本训练即可实现高效扩展,在处理海量外部知识的任务中显著优于现有方法,同时保持高度并行化的运行效率。
English: The ExtAgents multi-agent framework overcomes context window limitations by enabling scalable, parallel knowledge integration without extended training, significantly outperforming existing methods on tasks requiring extensive external data while maintaining high efficiency.

Authors:Bozhou Li, Wentao Zhang
Title: ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models
Abstract:
Currently, a prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously. While effective, this method generates a large number of image tokens. When combined with the widely used Rotary Position Embedding (RoPE), its long-term decay property hinders the interaction between high-resolution tokens and thumbnail tokens, as well as between text and image. To address these issues, we propose ID-Align, which alleviates these problems by reordering position IDs. In this method, high-resolution tokens inherit IDs from their corresponding thumbnail token while constraining the overexpansion of positional indices. Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements, including a 6.09% enhancement on MMBench's relation reasoning tasks and notable gains across multiple benchmarks. Our code is available at the following link: https://github.com/zooblastlbz/ID-Align.
Chinese: 针对视觉语言模型中同时编码高分辨率和缩略图图像导致的效率低下及交互限制问题,ID-Align方法通过重新排列位置标识来增强标记间交互,在多个基准测试中实现了显著性能提升。
English: To address the inefficiency and interaction limitations caused by encoding both high-resolution and thumbnail images in Vision-Language Models, the proposed ID-Align method reorders position IDs to enhance token interaction and achieves significant performance improvements across multiple benchmarks.

Authors:Xiao Liu, Da Yin, Zirui Wu, Yansong Feng
Title: RefTool: Enhancing Model Reasoning with Reference-Guided Tool Creation
Abstract:
Tools enhance the reasoning capabilities of large language models (LLMs) in complex problem-solving tasks, but not all tasks have available tools. In the absence of predefined tools, prior works have explored instructing LLMs to generate tools on their own. However, such approaches rely heavily on the models' internal knowledge and would fail in domains beyond the LLMs' knowledge scope. To address this limitation, we propose RefTool, a reference-guided framework for automatic tool creation that leverages structured external materials such as textbooks. RefTool consists of two modules: (1) tool creation, where LLMs generate executable tools from reference content, validate them using illustrative examples, and organize them hierarchically into a toolbox; and (2) tool utilization, where LLMs navigate the toolbox structure to select and apply the appropriate tools to solve problems. Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 11.3% on average accuracy, while being cost-efficient and broadly generalizable. Analyses reveal that grounding tool creation in references produces accurate and faithful tools, and that the hierarchical structure facilitates effective tool selection. RefTool enables LLMs to overcome knowledge limitations, demonstrating the value of grounding tool creation in external references for enhanced and generalizable reasoning.
中文摘要:RefTool框架通过利用教科书等结构化外部参考资料,指导大语言模型生成可执行工具并分层管理,有效突破了模型的知识局限,在因果推理、物理和化学任务中平均准确率提升11.3%。
English Summary: RefTool is a framework that enables large language models to create executable tools using structured external references like textbooks, overcoming knowledge limitations and improving reasoning accuracy by 11.3% through hierarchical organization and validation.

Authors:Xiusi Chen, Shanyong Wang, Cheng Qian, Hongru Wang, Peixuan Han, Heng Ji
Title: DecisionFlow: Advancing Large Language Model as Principled Decision Maker
Abstract:
In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model's reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. Code and data are at https://github.com/xiusic/DecisionFlow.
Chinese: DecisionFlow提出了一种结构化推理框架,通过构建语义决策空间并评估权衡来指导语言模型进行透明、效用驱动的决策,在关键领域实现了显著准确性提升与结果一致性增强。
English: DecisionFlow introduces a structured reasoning framework that enables language models to make transparent, utility-driven decisions by evaluating trade-offs within a semantically grounded decision space, achieving significant accuracy improvements and enhanced outcome alignment in high-stakes domains.

Authors:Yifei Liu, Li Lyna Zhang, Yi Zhu, Bingcheng Dong, Xudong Zhou, Ning Shang, Fan Yang, Mao Yang
Title: rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset
Abstract:
Advancing code reasoning in large language models (LLMs) is fundamentally limited by the scarcity of high-difficulty datasets, especially those with verifiable input-output test cases necessary for rigorous solution validation at scale. We introduce rStar-Coder, which significantly improves LLM code reasoning capabilities by constructing a large-scale, verified dataset of 418K competition-level code problems, 580K long-reasoning solutions along with rich test cases of varying difficulty. This is achieved through three core contributions: (1) we curate competitive programming code problems and oracle solutions to synthesize new, solvable problems; (2) we introduce a reliable input-output test case synthesis pipeline that decouples the generation into a three-step input generation method and a mutual verification mechanism for effective output labeling; (3) we augment problems with high-quality, test-case-verified long-reasoning solutions. Extensive experiments on Qwen models (1.5B-14B) across various code reasoning benchmarks demonstrate the superiority of rStar-Coder dataset, achieving leading performance comparable to frontier reasoning LLMs with much smaller model sizes. On LiveCodeBench, rStar-Coder improves Qwen2.5-7B from 17.4% to an impressive 57.3%, and Qwen2.5-14B from 23.3% to 62.5%, surpassing o3-mini (low) by3.1%. On the more challenging USA Computing Olympiad, our 7B model achieves an average pass@1 accuracy of 16.15%, outperforming the frontier-level QWQ-32B. Code and the dataset will be released at https://github.com/microsoft/rStar.
中文: rStar-Coder数据集通过提供41.8万个经过验证的竞赛级编程问题、测试用例和详细推理解决方案,显著提升了大型语言模型的代码推理能力,在多个基准测试中实现了突破性性能提升。
English: The rStar-Coder dataset enhances code reasoning in large language models by providing 418K verified competition-level problems with test cases and long-reasoning solutions, significantly boosting performance on benchmarks like LiveCodeBench and USA Computing Olympiad even with smaller models.

Authors:Yao Huang, Yitong Sun, Shouwei Ruan, Yichi Zhang, Yinpeng Dong, Xingxing Wei
Title: Breaking the Ceiling: Exploring the Potential of Jailbreak Attacks through Expanding Strategy Space
Abstract:
Large Language Models (LLMs), despite advanced general capabilities, still suffer from numerous safety risks, especially jailbreak attacks that bypass safety protocols. Understanding these vulnerabilities through black-box jailbreak attacks, which better reflect real-world scenarios, offers critical insights into model robustness. While existing methods have shown improvements through various prompt engineering techniques, their success remains limited against safety-aligned models, overlooking a more fundamental problem: the effectiveness is inherently bounded by the predefined strategy spaces. However, expanding this space presents significant challenges in both systematically capturing essential attack patterns and efficiently navigating the increased complexity. To better explore the potential of expanding the strategy space, we address these challenges through a novel framework that decomposes jailbreak strategies into essential components based on the Elaboration Likelihood Model (ELM) theory and develops genetic-based optimization with intention evaluation mechanisms. To be striking, our experiments reveal unprecedented jailbreak capabilities by expanding the strategy space: we achieve over 90% success rate on Claude-3.5 where prior methods completely fail, while demonstrating strong cross-model transferability and surpassing specialized safeguard models in evaluation accuracy. The code is open-sourced at: https://github.com/Aries-iai/CL-GSO.
中文: 该研究提出基于精细加工可能性模型和遗传优化的新框架,通过扩展越狱策略空间实现了对Claude-3.5等先进模型超过90%的攻击成功率,同时展现出强大的跨模型迁移能力并在评估准确率上超越专业防护模型。
English: This study introduces a novel framework that expands jailbreak strategy spaces using ELM theory and genetic optimization, achieving over 90% success against advanced models like Claude-3.5 while demonstrating strong transferability and surpassing safeguard models in accuracy.

Authors:Dosung Lee, Wonjun Oh, Boyoung Kim, Minyoung Kim, Joonsuk Park, Paul Hongsuck Seo
Title: ReSCORE: Label-free Iterative Retriever Training for Multi-hop Question Answering with Relevance-Consistency Supervision
Abstract:
Multi-hop question answering (MHQA) involves reasoning across multiple documents to answer complex questions. Dense retrievers typically outperform sparse methods like BM25 by leveraging semantic embeddings; however, they require labeled query-document pairs for fine-tuning. This poses a significant challenge in MHQA due to the high variability of queries (reformulated) questions throughout the reasoning steps. To overcome this limitation, we introduce Retriever Supervision with Consistency and Relevance (ReSCORE), a novel method for training dense retrievers for MHQA without labeled documents. ReSCORE leverages large language models to capture each documents relevance to the question and consistency with the correct answer and use them to train a retriever within an iterative question-answering framework. Experiments on three MHQA benchmarks demonstrate the effectiveness of ReSCORE, with significant improvements in retrieval, and in turn, the state-of-the-art MHQA performance. Our implementation is available at: https://leeds1219.github.io/ReSCORE.

Authors:Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly
Title: Evaluation of LLMs in Medical Text Summarization: The Role of Vocabulary Adaptation in High OOV Settings
Abstract:
Large Language Models (LLMs) recently achieved great success in medical text summarization by simply using in-context learning. However, these recent efforts do not perform fine-grained evaluations under difficult settings where LLMs might fail. They typically report performance scores over the entire dataset. Through our benchmarking study, we show that LLMs show a significant performance drop for data points with high concentration of out-of-vocabulary (OOV) words or with high novelty. Vocabulary adaptation is an intuitive solution to this vocabulary mismatch issue where the LLM vocabulary gets updated with certain expert domain (here, medical) words or subwords. An interesting finding from our study is that Llama-3.1, even with a vocabulary size of around 128K tokens, still faces over-fragmentation issue with medical words. To that end, we show vocabulary adaptation helps improve the LLM summarization performance even in difficult settings. Through extensive experimentation of multiple vocabulary adaptation strategies, two continual pretraining strategies, and three benchmark medical summarization datasets, we gain valuable insights into the role of vocabulary adaptation strategies for customizing LLMs to the medical domain. We also performed a human evaluation study with medical experts where they found that vocabulary adaptation results in more relevant and faithful summaries. Our codebase is made publicly available at https://github.com/gb-kgp/LLM-MedicalSummarization-Benchmark.
中文: 大型语言模型在涉及生僻词或高新颖性的困难场景中处理医学文本摘要时表现不佳,但通过词汇适应策略融入领域专业术语可显著提升其摘要质量。
English: Large Language Models (LLMs) struggle with medical text summarization in challenging scenarios involving out-of-vocabulary words or high novelty, but vocabulary adaptation significantly improves their performance by incorporating domain-specific terms.

Authors:Yu He, Zihan Yao, Chentao Song, Tianyu Qi, Jun Liu, Ming Li, Qing Huang
Title: LMCD: Language Models are Zeroshot Cognitive Diagnosis Learners
Abstract:
Cognitive Diagnosis (CD) has become a critical task in AI-empowered education, supporting personalized learning by accurately assessing students' cognitive states. However, traditional CD models often struggle in cold-start scenarios due to the lack of student-exercise interaction data. Recent NLP-based approaches leveraging pre-trained language models (PLMs) have shown promise by utilizing textual features but fail to fully bridge the gap between semantic understanding and cognitive profiling. In this work, we propose Language Models as Zeroshot Cognitive Diagnosis Learners (LMCD), a novel framework designed to handle cold-start challenges by harnessing large language models (LLMs). LMCD operates via two primary phases: (1) Knowledge Diffusion, where LLMs generate enriched contents of exercises and knowledge concepts (KCs), establishing stronger semantic links; and (2) Semantic-Cognitive Fusion, where LLMs employ causal attention mechanisms to integrate textual information and student cognitive states, creating comprehensive profiles for both students and exercises. These representations are efficiently trained with off-the-shelf CD models. Experiments on two real-world datasets demonstrate that LMCD significantly outperforms state-of-the-art methods in both exercise-cold and domain-cold settings. The code is publicly available at https://github.com/TAL-auroraX/LMCD
中文:提出的LMCD框架通过大语言模型增强习题语义并融合认知状态,解决了认知诊断中的冷启动问题,在冷启动场景下表现卓越。
English: The proposed LMCD framework overcomes cold-start challenges in cognitive diagnosis by using large language models to enrich exercise semantics and fuse them with cognitive states, achieving superior performance in cold-start scenarios.

Authors:Jong Hak Moon, Geon Choi, Paloma Rabaey, Min Gwan Kim, Hyuk Gi Hong, Jung-Oh Lee, Hangyul Yoon, Eun Woo Doe, Jiyoun Kim, Harshita Sharma, Daniel C. Castro, Javier Alvarez-Valle, Edward Choi
Title: Lunguage: A Benchmark for Structured and Sequential Chest X-ray Interpretation
Abstract:
Radiology reports convey detailed clinical observations and capture diagnostic reasoning that evolves over time. However, existing evaluation methods are limited to single-report settings and rely on coarse metrics that fail to capture fine-grained clinical semantics and temporal dependencies. We introduce LUNGUAGE,a benchmark dataset for structured radiology report generation that supports both single-report evaluation and longitudinal patient-level assessment across multiple studies. It contains 1,473 annotated chest X-ray reports, each reviewed by experts, and 80 of them contain longitudinal annotations to capture disease progression and inter-study intervals, also reviewed by experts. Using this benchmark, we develop a two-stage framework that transforms generated reports into fine-grained, schema-aligned structured representations, enabling longitudinal interpretation. We also propose LUNGUAGESCORE, an interpretable metric that compares structured outputs at the entity, relation, and attribute level while modeling temporal consistency across patient timelines. These contributions establish the first benchmark dataset, structuring framework, and evaluation metric for sequential radiology reporting, with empirical results demonstrating that LUNGUAGESCORE effectively supports structured report evaluation. The code is available at: https://github.com/SuperSupermoon/Lunguage
中文: 我们推出了LUNGUAGE,这是一个用于结构化放射学报告生成的基准数据集和框架,支持细粒度和纵向评估,同时提出了可解释的LUNGUAGESCORE指标,用于评估临床语义和时间一致性。
English: We introduce LUNGUAGE, a benchmark dataset and framework for structured radiology report generation that enables fine-grained and longitudinal evaluation, along with LUNGUAGESCORE, an interpretable metric for assessing clinical semantics and temporal consistency.

Authors:Wei Chen, Zhao Zhang, Meng Yuan, Kepeng Xu, Fuzhen Zhuang
Title: FCKT: Fine-Grained Cross-Task Knowledge Transfer with Semantic Contrastive Learning for Targeted Sentiment Analysis
Abstract:
In this paper, we address the task of targeted sentiment analysis (TSA), which involves two sub-tasks, i.e., identifying specific aspects from reviews and determining their corresponding sentiments. Aspect extraction forms the foundation for sentiment prediction, highlighting the critical dependency between these two tasks for effective cross-task knowledge transfer. While most existing studies adopt a multi-task learning paradigm to align task-specific features in the latent space, they predominantly rely on coarse-grained knowledge transfer. Such approaches lack fine-grained control over aspect-sentiment relationships, often assuming uniform sentiment polarity within related aspects. This oversimplification neglects contextual cues that differentiate sentiments, leading to negative transfer. To overcome these limitations, we propose FCKT, a fine-grained cross-task knowledge transfer framework tailored for TSA. By explicitly incorporating aspect-level information into sentiment prediction, FCKT achieves fine-grained knowledge transfer, effectively mitigating negative transfer and enhancing task performance. Experiments on three datasets, including comparisons with various baselines and large language models (LLMs), demonstrate the effectiveness of FCKT. The source code is available on https://github.com/cwei01/FCKT.
中文摘要:本文提出FCKT框架,通过细粒度的跨任务知识迁移,在目标情感分析中结合方面级信息优化情感预测,有效缓解负迁移问题并提升任务性能。
English Summary: This paper introduces FCKT, a fine-grained cross-task knowledge transfer framework for targeted sentiment analysis that addresses limitations in existing methods by incorporating aspect-level information to improve sentiment prediction and reduce negative transfer.

Authors:Cainan Davidson, Deva Ramanan, Neehar Peri
Title: RefAV: Towards Planning-Centric Scenario Mining
Abstract:
Autonomous Vehicles (AVs) collect and pseudo-label terabytes of multi-modal data localized to HD maps during normal fleet testing. However, identifying interesting and safety-critical scenarios from uncurated driving logs remains a significant challenge. Traditional scenario mining techniques are error-prone and prohibitively time-consuming, often relying on hand-crafted structured queries. In this work, we revisit spatio-temporal scenario mining through the lens of recent vision-language models (VLMs) to detect whether a described scenario occurs in a driving log and, if so, precisely localize it in both time and space. To address this problem, we introduce RefAV, a large-scale dataset of 10,000 diverse natural language queries that describe complex multi-agent interactions relevant to motion planning derived from 1000 driving logs in the Argoverse 2 Sensor dataset. We evaluate several referential multi-object trackers and present an empirical analysis of our baselines. Notably, we find that naively repurposing off-the-shelf VLMs yields poor performance, suggesting that scenario mining presents unique challenges. Lastly, we discuss our recent CVPR 2025 competition and share insights from the community. Our code and dataset are available at https://github.com/CainanD/RefAV/ and https://argoverse.github.io/user-guide/tasks/scenario_mining.html
中文: 本研究提出RefAV方法,利用视觉语言模型从自动驾驶车辆数据中高效挖掘安全关键场景,通过新数据集和实证分析解决了传统方法的局限性。
English: This study introduces RefAV, a novel approach using vision-language models to efficiently mine safety-critical driving scenarios from autonomous vehicle data, addressing the limitations of traditional methods with a new dataset and empirical analysis.

Authors:Zhibo Wang, Xiaoze Jiang, Zhiheng Qin, Enyun Yu, Han Li
Title: Personalized Query Auto-Completion for Long and Short-Term Interests with Adaptive Detoxification Generation
Abstract:
Query auto-completion (QAC) plays a crucial role in modern search systems. However, in real-world applications, there are two pressing challenges that still need to be addressed. First, there is a need for hierarchical personalized representations for users. Previous approaches have typically used users' search behavior as a single, overall representation, which proves inadequate in more nuanced generative scenarios. Additionally, query prefixes are typically short and may contain typos or sensitive information, increasing the likelihood of generating toxic content compared to traditional text generation tasks. Such toxic content can degrade user experience and lead to public relations issues. Therefore, the second critical challenge is detoxifying QAC systems. To address these two limitations, we propose a novel model (LaD) that captures personalized information from both long-term and short-term interests, incorporating adaptive detoxification. In LaD, personalized information is captured hierarchically at both coarse-grained and fine-grained levels. This approach preserves as much personalized information as possible while enabling online generation within time constraints. To move a futher step, we propose an online training method based on Reject Preference Optimization (RPO). By incorporating a special token [Reject] during both the training and inference processes, the model achieves adaptive detoxification. Consequently, the generated text presented to users is both non-toxic and relevant to the given prefix. We conduct comprehensive experiments on industrial-scale datasets and perform online A/B tests, delivering the largest single-experiment metric improvement in nearly two years of our product. Our model has been deployed on Kuaishou search, driving the primary traffic for hundreds of millions of active users. The code is available at https://github.com/JXZe/LaD.
中文: LaD模型通过分层捕捉用户长短期兴趣并采用基于拒绝偏好优化的自适应去毒方法,解决了查询自动补全中的个性化表示和内容去毒两大挑战,在实际应用中取得了显著成效。
English: The LaD model addresses hierarchical personalization and detoxification in query auto-completion by capturing users' long-term and short-term interests and incorporating adaptive detoxification through Reject Preference Optimization, achieving significant improvements in real-world deployment.

Authors:Pingrui Zhang, Yifei Su, Pengyuan Wu, Dong An, Li Zhang, Zhigang Wang, Dong Wang, Yan Ding, Bin Zhao, Xuelong Li
Title: Cross from Left to Right Brain: Adaptive Text Dreamer for Vision-and-Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) requires the agent to navigate by following natural instructions under partial observability, making it difficult to align perception with language. Recent methods mitigate this by imagining future scenes, yet they rely on vision-based synthesis, leading to high computational cost and redundant details. To this end, we propose to adaptively imagine key environmental semantics via \textit{language} form, enabling a more reliable and efficient strategy. Specifically, we introduce a novel Adaptive Text Dreamer (ATD), a dual-branch self-guided imagination policy built upon a large language model (LLM). ATD is designed with a human-like left-right brain architecture, where the left brain focuses on logical integration, and the right brain is responsible for imaginative prediction of future scenes. To achieve this, we fine-tune only the Q-former within both brains to efficiently activate domain-specific knowledge in the LLM, enabling dynamic updates of logical reasoning and imagination during navigation. Furthermore, we introduce a cross-interaction mechanism to regularize the imagined outputs and inject them into a navigation expert module, allowing ATD to jointly exploit both the reasoning capacity of the LLM and the expertise of the navigation model. We conduct extensive experiments on the R2R benchmark, where ATD achieves state-of-the-art performance with fewer parameters. The code is \href{https://github.com/zhangpingrui/Adaptive-Text-Dreamer}{here}.
中文: 提出的自适应文本想象器(ATD)采用双分支大语言模型架构,通过语言形式想象关键环境语义,以更低计算成本和参数量实现了领先的导航性能。
English: The proposed Adaptive Text Dreamer (ATD) uses a dual-branch LLM architecture to imagine future environmental semantics through language, reducing computational costs while achieving state-of-the-art navigation performance with fewer parameters.

Authors:Jiyoung Lee, Seungho Kim, Jieun Han, Jun-Min Lee, Kitaek Kim, Alice Oh, Edward Choi
Title: Trans-EnV: A Framework for Evaluating the Linguistic Robustness of LLMs Against English Varieties
Abstract:
Large Language Models (LLMs) are predominantly evaluated on Standard American English (SAE), often overlooking the diversity of global English varieties. This narrow focus may raise fairness concerns as degraded performance on non-standard varieties can lead to unequal benefits for users worldwide. Therefore, it is critical to extensively evaluate the linguistic robustness of LLMs on multiple non-standard English varieties. We introduce Trans-EnV, a framework that automatically transforms SAE datasets into multiple English varieties to evaluate the linguistic robustness. Our framework combines (1) linguistics expert knowledge to curate variety-specific features and transformation guidelines from linguistic literature and corpora, and (2) LLM-based transformations to ensure both linguistic validity and scalability. Using Trans-EnV, we transform six benchmark datasets into 38 English varieties and evaluate seven state-of-the-art LLMs. Our results reveal significant performance disparities, with accuracy decreasing by up to 46.3% on non-standard varieties. These findings highlight the importance of comprehensive linguistic robustness evaluation across diverse English varieties. Each construction of Trans-EnV was validated through rigorous statistical testing and consultation with a researcher in the field of second language acquisition, ensuring its linguistic validity. Our code and datasets are publicly available at https://github.com/jiyounglee-0523/TransEnV and https://huggingface.co/collections/jiyounglee0523/transenv-681eadb3c0c8cf363b363fb1.
中文摘要:当前大语言模型评估主要针对标准美式英语,可能引发公平性问题,因此开发了Trans-EnV框架来自动将数据集转换为38种英语变体,发现非标准变体性能下降高达46.3%。
English Summary: Current LLM evaluations primarily focus on Standard American English, potentially causing fairness issues, so the Trans-EnV framework was developed to automatically transform datasets into 38 English varieties, revealing performance drops of up to 46.3% in non-standard varieties.

Authors:Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Title: AdParaphrase v2.0: Generating Attractive Ad Texts Using a Preference-Annotated Paraphrase Dataset
Abstract:
Identifying factors that make ad text attractive is essential for advertising success. This study proposes AdParaphrase v2.0, a dataset for ad text paraphrasing, containing human preference data, to enable the analysis of the linguistic factors and to support the development of methods for generating attractive ad texts. Compared with v1.0, this dataset is 20 times larger, comprising 16,460 ad text paraphrase pairs, each annotated with preference data from ten evaluators, thereby enabling a more comprehensive and reliable analysis. Through the experiments, we identified multiple linguistic features of engaging ad texts that were not observed in v1.0 and explored various methods for generating attractive ad texts. Furthermore, our analysis demonstrated the relationships between human preference and ad performance, and highlighted the potential of reference-free metrics based on large language models for evaluating ad text attractiveness. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase-v2.0.
中文摘要:本研究推出的AdParaphrase v2.0数据集通过大规模人工偏好标注,能够识别吸引人的广告文本语言特征,并探索了广告文本生成方法与评估指标的有效性。
English Summary: This study introduces AdParaphrase v2.0, a significantly expanded dataset with human preference annotations that enables identification of linguistic features for creating engaging ad texts and explores methods for their generation and evaluation.

Authors:Yuhao Wang, Ruiyang Ren, Yucheng Wang, Wayne Xin Zhao, Jing Liu, Hua Wu, Haifeng Wang
Title: Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation
Abstract:
Long-form question answering (LFQA) presents unique challenges for large language models, requiring the synthesis of coherent, paragraph-length answers. While retrieval-augmented generation (RAG) systems have emerged as a promising solution, existing research struggles with key limitations: the scarcity of high-quality training data for long-form generation, the compounding risk of hallucination in extended outputs, and the absence of reliable evaluation metrics for factual completeness. In this paper, we propose RioRAG, a novel reinforcement learning (RL) framework that advances long-form RAG through reinforced informativeness optimization. Our approach introduces two fundamental innovations to address the core challenges. First, we develop an RL training paradigm of reinforced informativeness optimization that directly optimizes informativeness and effectively addresses the slow-thinking deficit in conventional RAG systems, bypassing the need for expensive supervised data. Second, we propose a nugget-centric hierarchical reward modeling approach that enables precise assessment of long-form answers through a three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment. Extensive experiments on two LFQA benchmarks LongFact and RAGChecker demonstrate the effectiveness of the proposed method. Our codes are available at https://github.com/RUCAIBox/RioRAG.
中文: RioRAG框架通过强化学习优化信息丰富性和采用以要点为中心的分层奖励模型,提升了长形式问答的事实准确性,有效解决了检索增强生成系统中数据稀缺和评估困难等核心问题。
English: The RioRAG framework introduces a reinforcement learning approach with reinforced informativeness optimization and a nugget-centric hierarchical reward model to enhance long-form question answering by improving factual accuracy and addressing data scarcity and evaluation challenges in retrieval-augmented generation systems.

Authors:Noy Sternlicht, Tom Hope
Title: CHIMERA: A Knowledge Base of Scientific Idea Recombinations for Research Analysis and Ideation
Abstract:
A hallmark of human innovation is recombination -- the creation of novel ideas by integrating elements from existing concepts and mechanisms. In this work, we introduce CHIMERA, a large-scale Knowledge Base (KB) of over 28K recombination examples automatically mined from the scientific literature. CHIMERA enables large-scale empirical analysis of how scientists recombine concepts and draw inspiration from different areas, and enables training models that propose novel, cross-disciplinary research directions. To construct this KB, we define a new information extraction task: identifying recombination instances in scientific abstracts. We curate a high-quality, expert-annotated dataset and use it to fine-tune a large language model, which we apply to a broad corpus of AI papers. We showcase the utility of CHIMERA through two applications. First, we analyze patterns of recombination across AI subfields. Second, we train a scientific hypothesis generation model using the KB, showing that it can propose novel research directions that researchers rate as inspiring. We release our data and code at https://github.com/noy-sternlicht/CHIMERA-KB.
Chinese: 本文介绍了CHIMERA,一个从科学文献中自动挖掘超过2.8万个重组案例的大规模知识库,可用于分析跨学科创新模式并训练生成新颖研究方向的人工智能模型。
English: This paper introduces CHIMERA, a large-scale Knowledge Base of over 28K recombination examples mined from scientific literature, enabling analysis of cross-disciplinary innovation and training models for novel research direction proposals.

Authors:Jungyoub Cha, Hyunjong Kim, Sungzoon Cho
Title: SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
Abstract:
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), but its performance degrades on long inputs due to increased attention cost and reduced draft accuracy. We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences without any additional training. First, SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models. To improve draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that uses the target model's attention scores to dynamically select relevant context for the draft model. Extensive evaluations on three long-context understanding datasets show that SpecExtend accelerates standard tree-based speculative decoding by up to 2.22x for inputs up to 16K tokens, providing an effective solution for speculative decoding of long sequences. Our code is available at https://github.com/jycha98/SpecExtend .
推测解码在长输入下性能下降,而SpecExtend通过高效注意力机制和新型KV缓存策略,无需额外训练即可将长序列处理速度提升最高3.86倍。
Speculative decoding performance declines with longer inputs, but SpecExtend enhances it for long sequences using efficient attention and a novel KV cache strategy, achieving up to 3.86x speedup without extra training.

Authors:Xiaqiang Tang, Jian Li, Keyu Hu, Du Nan, Xiaolong Li, Xi Zhang, Weigao Sun, Sihong Xie
Title: CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
Abstract:
Faithfulness hallucinations are claims generated by a Large Language Model (LLM) not supported by contexts provided to the LLM. Lacking assessment standards, existing benchmarks focus on "factual statements" that rephrase source materials while overlooking "cognitive statements" that involve making inferences from the given context. Consequently, evaluating and detecting the hallucination of cognitive statements remains challenging. Inspired by how evidence is assessed in the legal domain, we design a rigorous framework to assess different levels of faithfulness of cognitive statements and introduce the CogniBench dataset where we reveal insightful statistics. To keep pace with rapidly evolving LLMs, we further develop an automatic annotation pipeline that scales easily across different models. This results in a large-scale CogniBench-L dataset, which facilitates training accurate detectors for both factual and cognitive hallucinations. We release our model and datasets at: https://github.com/FUTUREEEEEE/CogniBench
中文摘要:本研究提出了CogniBench框架和数据集,专门评估大型语言模型中对认知陈述的忠实度幻觉,并通过自动化流程生成大规模训练数据,以提升事实性和认知性幻觉的检测能力。
English Summary: The study introduces CogniBench, a framework and dataset for evaluating faithfulness hallucinations in LLMs, particularly focusing on cognitive statements, and develops an automated pipeline to create large-scale training data for detecting both factual and cognitive inaccuracies.

Authors:Hanlin Wang, Chak Tou Leong, Jiashuo Wang, Jian Wang, Wenjie Li
Title: SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
Abstract:
Reinforcement learning (RL) holds significant promise for training LLM agents to handle complex, goal-oriented tasks that require multi-step interactions with external environments. However, a critical challenge when applying RL to these agentic tasks arises from delayed rewards: feedback signals are typically available only after the entire task is completed. This makes it non-trivial to assign delayed rewards to earlier actions, providing insufficient guidance regarding environmental constraints and hindering agent training. In this work, we draw on the insight that the ultimate completion of a task emerges from the cumulative progress an agent makes across individual steps. We propose Stepwise Progress Attribution (SPA), a general reward redistribution framework that decomposes the final reward into stepwise contributions, each reflecting its incremental progress toward overall task completion. To achieve this, we train a progress estimator that accumulates stepwise contributions over a trajectory to match the task completion. During policy optimization, we combine the estimated per-step contribution with a grounding signal for actions executed in the environment as the fine-grained, intermediate reward for effective agent training. Extensive experiments on common agent benchmarks (including Webshop, ALFWorld, and VirtualHome) demonstrate that SPA consistently outperforms the state-of-the-art method in both success rate (+2.5\% on average) and grounding accuracy (+1.9\% on average). Further analyses demonstrate that our method remarkably provides more effective intermediate rewards for RL training. Our code is available at https://github.com/WangHanLinHenry/SPA-RL-Agent.
中文摘要:强化学习在训练LLM智能体时面临延迟奖励的挑战,而提出的逐步进展归因(SPA)框架通过将最终奖励分解为逐步贡献来解决这一问题,有效提升训练效果和性能表现。
English Summary: Reinforcement learning faces delayed reward challenges in training LLM agents, which the proposed Stepwise Progress Attribution (SPA) framework addresses by decomposing final rewards into stepwise contributions to improve training effectiveness and performance.

Authors:Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chenliang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, Fei Huang, Yang Liu
Title: MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
Abstract:
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos. Despite recent advances in general video understanding, current MLLMs still struggle with fine-grained temporal reasoning. While reinforcement learning (RL) has been explored to address this issue recently, existing RL approaches remain limited in effectiveness. In this work, we propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding. MUSEG enables MLLMs to align queries with multiple relevant video segments, promoting more comprehensive temporal reasoning. To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning. Extensive experiments on temporal grounding and time-sensitive video QA tasks demonstrate that MUSEG significantly outperforms existing methods and generalizes well across diverse temporal understanding scenarios. View our project at https://github.com/THUNLP-MT/MUSEG.
中文:MUSEG是一种基于强化学习的新方法,通过引入时间戳感知的多片段定位和分阶段奖励训练,显著提升了多模态大语言模型的时序理解能力,在时序推理任务中明显优于现有方法。
English: MUSEG is a novel reinforcement learning method that enhances multimodal large language models' temporal understanding by enabling timestamp-aware multi-segment grounding and phased reward training, significantly outperforming existing approaches in temporal reasoning tasks.

Authors:Danush Khanna, Pratinav Seth, Sidhaarth Sredharan Murali, Aditya Kumar Guru, Siddharth Shukla, Tanuj Tyagi, Sandeep Chaurasia, Kripabandhu Ghosh
Title: SELF-PERCEPT: Introspection Improves Large Language Models' Detection of Multi-Person Mental Manipulation in Conversations
Abstract:
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation's nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .
中文: 本研究提出了MultiManip数据集和SELF-PERCEPT框架,以解决大型语言模型在多轮对话中检测微妙心理操纵的难题,相比现有模型展现出显著性能提升。
English: The study introduces the MultiManip dataset and SELF-PERCEPT framework to address LLMs' challenges in detecting nuanced mental manipulation in multi-turn dialogues, showing significant improvement over existing models.

Authors:Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li, Ziwei He, Xinbing Wang, Zhiyu Li, Zhouhan Lin
Title: Pretraining Language Models to Ponder in Continuous Space
Abstract:
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort. In this work, we introduce this pondering process into language models by repeatedly invoking the forward process within a single token generation step. During pondering, instead of generating an actual token sampled from the prediction distribution, the model ponders by yielding a weighted sum of all token embeddings according to the predicted token distribution. The generated embedding is then fed back as input for another forward pass. We show that the model can learn to ponder in this way through self-supervised learning, without any human annotations. Experiments across three widely used open-source architectures-GPT-2, Pythia, and LLaMA-and extensive downstream task evaluations demonstrate the effectiveness and generality of our method. For language modeling tasks, pondering language models achieve performance comparable to vanilla models with twice the number of parameters. On 9 downstream benchmarks, our pondering-enhanced Pythia models significantly outperform the official Pythia models. Notably, PonderingPythia-2.8B surpasses Pythia-6.9B, and PonderingPythia-1B is comparable to TinyLlama-1.1B, which is trained on 10 times more data. The code is available at https://github.com/LUMIA-Group/PonderingLM.
中文摘要:本研究通过单次生成步骤中迭代处理词嵌入的方式,为语言模型引入“深思”机制,使模型能通过自监督学习以更少参数实现与大型模型相当的性能。
English Summary: This study introduces a "pondering" mechanism into language models by iteratively processing token embeddings within a single generation step, enabling models to achieve performance comparable to larger counterparts with fewer parameters through self-supervised learning.

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri
Title: Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models
Abstract:
Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 16.8 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl/ and https://universe.roboflow.com/rf100-vl/
中文: 视觉语言模型在常见物体检测上表现出色,但难以泛化到分布外概念,为此我们提出Roboflow100-VL基准,通过多模态指令实现少样本概念对齐以解决这一局限。
English: Vision-language models excel at detecting common objects but struggle with out-of-distribution concepts, prompting the introduction of Roboflow100-VL for few-shot alignment using multimodal instructions to address this limitation.

Authors:Shenao Zhang, Yaqing Wang, Yinxiao Liu, Tianqi Liu, Peter Grabowski, Eugene Ie, Zhaoran Wang, Yunxuan Li
Title: Beyond Markovian: Reflective Exploration via Bayes-Adaptive RL for LLM Reasoning
Abstract:
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have exhibited strong reasoning capabilities and emergent reflective behaviors, such as backtracking and error correction. However, conventional Markovian RL confines exploration to the training phase to learn an optimal deterministic policy and depends on the history contexts only through the current state. Therefore, it remains unclear whether reflective reasoning will emerge during Markovian RL training, or why they are beneficial at test time. To remedy this, we recast reflective exploration within the Bayes-Adaptive RL framework, which explicitly optimizes the expected return under a posterior distribution over Markov decision processes. This Bayesian formulation inherently incentivizes both reward-maximizing exploitation and information-gathering exploration via belief updates. Our resulting algorithm, BARL, instructs the LLM to stitch and switch strategies based on the observed outcomes, offering principled guidance on when and how the model should reflectively explore. Empirical results on both synthetic and mathematical reasoning tasks demonstrate that BARL outperforms standard Markovian RL approaches at test time, achieving superior token efficiency with improved exploration effectiveness. Our code is available at https://github.com/shenao-zhang/BARL.
Chinese Summary: 本研究提出BARL这一贝叶斯强化学习框架,通过增强大语言模型的反思性探索与策略适应能力,在推理任务中实现了优于传统方法的性能与更高的效率。
English Summary: The study introduces BARL, a Bayesian reinforcement learning framework that enhances large language models' reflective exploration and strategy adaptation, outperforming standard methods in reasoning tasks with greater efficiency.

Authors:Zihong Chen, Wanli Jiang, Jinzhe Li, Zhonghang Yuan, Huanjun Kong, Wanli Ouyang, Nanqing Dong
Title: GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
Abstract:
Fine-tuning for large language models (LLMs) typically requires substantial amounts of high-quality supervised data, which is both costly and labor-intensive to acquire. While synthetic data generation has emerged as a promising solution, existing approaches frequently suffer from factual inaccuracies, insufficient long-tail coverage, simplistic knowledge structures, and homogenized outputs. To address these challenges, we introduce GraphGen, a knowledge graph-guided framework designed for three key question-answering (QA) scenarios: atomic QA, aggregated QA, and multi-hop QA. It begins by constructing a fine-grained knowledge graph from the source text. It then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge. Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data. Experimental results on knowledge-intensive tasks under closed-book settings demonstrate that GraphGen outperforms conventional synthetic data methods, offering a more reliable and comprehensive solution to the data scarcity challenge in supervised fine-tuning. The code and data are publicly available at https://github.com/open-sciencelab/GraphGen.
中文摘要:GraphGen是一个知识图谱引导的框架,通过识别大语言模型的知识缺口并采用多跳采样技术生成高质量问答数据,在解决监督微调数据稀缺问题上优于传统合成方法。
English Summary: GraphGen is a knowledge graph-guided framework that generates high-quality synthetic question-answering data by identifying knowledge gaps in LLMs and incorporating multi-hop sampling, outperforming conventional methods in addressing data scarcity for fine-tuning.

Authors:Jaeyoung Choe, Jihoon Kim, Woohwan Jung
Title: Hierarchical Retrieval with Evidence Curation for Open-Domain Financial Question Answering on Standardized Documents
Abstract:
Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
中文摘要:针对金融领域标准化文档重复内容导致检索精度下降的问题,提出HiREC框架,通过分层检索与证据筛选机制有效消除近似重复文本,并构建LOFin基准验证其优越性。
English Summary: The HiREC framework is introduced to enhance retrieval-augmented generation in finance by using hierarchical retrieval and evidence curation to eliminate duplicate and irrelevant text, improving accuracy in processing standardized documents.

Authors:Juntong Wu, Zijing Liu, He Cao, Hao Li, Bin Feng, Zishan Shu, Ke Yu, Li Yuan, Yu Li
Title: Rethinking Text-based Protein Understanding: Retrieval or LLM?
Abstract:
In recent years, protein-text models have gained significant attention for their potential in protein generation and understanding. Current approaches focus on integrating protein-related knowledge into large language models through continued pretraining and multi-modal alignment, enabling simultaneous comprehension of textual descriptions and protein sequences. Through a thorough analysis of existing model architectures and text-based protein understanding benchmarks, we identify significant data leakage issues present in current benchmarks. Moreover, conventional metrics derived from natural language processing fail to accurately assess the model's performance in this domain. To address these limitations, we reorganize existing datasets and introduce a novel evaluation framework based on biological entities. Motivated by our observation, we propose a retrieval-enhanced method, which significantly outperforms fine-tuned LLMs for protein-to-text generation and shows accuracy and efficiency in training-free scenarios. Our code and data can be seen at https://github.com/IDEA-XL/RAPM.
中文: 该研究揭示了当前蛋白质-文本模型存在数据泄露和评估指标不足的问题,并提出了一种检索增强方法,在无需训练的情况下显著优于微调大语言模型,实现了更高的准确性和效率。
English: The study identifies data leakage and inadequate evaluation metrics in current protein-text models, proposing a retrieval-enhanced method that surpasses fine-tuned LLMs in protein-to-text generation with improved accuracy and efficiency.

Authors:Wenkai Fang, Shunyu Liu, Yang Zhou, Kongcheng Zhang, Tongya Zheng, Kaixuan Chen, Mingli Song, Dacheng Tao
Title: SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data
Abstract:
Recent advances have demonstrated the effectiveness of Reinforcement Learning (RL) in improving the reasoning capabilities of Large Language Models (LLMs). However, existing works inevitably rely on high-quality instructions and verifiable rewards for effective training, both of which are often difficult to obtain in specialized domains. In this paper, we propose Self-play Reinforcement Learning(SeRL) to bootstrap LLM training with limited initial data. Specifically, SeRL comprises two complementary modules: self-instruction and self-rewarding. The former module generates additional instructions based on the available data at each training step, employing robust online filtering strategies to ensure instruction quality, diversity, and difficulty. The latter module introduces a simple yet effective majority-voting mechanism to estimate response rewards for additional instructions, eliminating the need for external annotations. Finally, SeRL performs conventional RL based on the generated data, facilitating iterative self-play learning. Extensive experiments on various reasoning benchmarks and across different LLM backbones demonstrate that the proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards. Our code is available at https://github.com/wantbook-book/SeRL.
中文: 本文提出自对弈强化学习(SeRL)方法,通过自我生成指令和奖励机制,使大语言模型能够在缺乏外部高质量数据的情况下实现有效训练,在多项推理基准测试中取得了优于同类方法的性能表现。
English: This paper introduces Self-play Reinforcement Learning (SeRL), a method that enables large language models to generate their own instructions and rewards for effective training without relying on external high-quality data, achieving superior reasoning performance across various benchmarks.

Authors:Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li
Title: Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset
Abstract:
Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$text, speech$>$ paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.
中文: EmoCorrector采用检索增强生成技术,通过提取文本情感特征并匹配情感语音样本,在保持说话人特征和音质的同时,有效解决了文本语音编辑中的情感不一致问题。
English: EmoCorrector introduces a post-correction scheme using Retrieval-Augmented Generation to address emotional inconsistencies in text-based speech editing by aligning synthesized speech with desired emotions while preserving speaker identity and quality.

Authors:Mathew J. Koretsky, Maya Willey, Adi Asija, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak, Nicole Kuznetsov, Sungwon Kim, Mike A. Nalls, Daniel Khashabi, Faraz Faghri
Title: BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Abstract:
Biomedical researchers increasingly rely on large-scale structured databases for complex analytical tasks. However, current text-to-SQL systems often struggle to map qualitative scientific questions into executable SQL, particularly when implicit domain reasoning is required. We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base. BiomedSQL comprises 68,000 question/SQL query/answer triples grounded in a harmonized BigQuery knowledge base that integrates gene-disease associations, causal inference from omics data, and drug approval records. Each question requires models to infer domain-specific criteria, such as genome-wide significance thresholds, effect directionality, or trial phase filtering, rather than rely on syntactic translation alone. We evaluate a range of open- and closed-source LLMs across prompting strategies and interaction paradigms. Our results reveal a substantial performance gap: GPT-o3-mini achieves 59.0% execution accuracy, while our custom multi-step agent, BMSQL, reaches 62.6%, both well below the expert baseline of 90.0%. BiomedSQL provides a new foundation for advancing text-to-SQL systems capable of supporting scientific discovery through robust reasoning over structured biomedical knowledge bases. Our dataset is publicly available at https://huggingface.co/datasets/NIH-CARD/BiomedSQL, and our code is open-source at https://github.com/NIH-CARD/biomedsql.
中文: BiomedSQL是首个专门评估生物医学知识库中文本转SQL系统科学推理能力的基准,结果显示现有模型性能与专家水平存在显著差距,为提升结构化数据推理支持科学发现奠定了基础。
English: BiomedSQL is a new benchmark designed to assess scientific reasoning in text-to-SQL systems for biomedical databases, revealing a significant performance gap where even the best models fall well below expert accuracy.

Authors:Qinyu Zhao, Jaskirat Singh, Ming Xu, Akshay Asthana, Stephen Gould, Liang Zheng
Title: DiSA: Diffusion Step Annealing in Autoregressive Image Generation
Abstract:
An increasing number of autoregressive models, such as MAR, FlowAR, xAR, and Harmon adopt diffusion sampling to improve the quality of image generation. However, this strategy leads to low inference efficiency, because it usually takes 50 to 100 steps for diffusion to sample a token. This paper explores how to effectively address this issue. Our key motivation is that as more tokens are generated during the autoregressive process, subsequent tokens follow more constrained distributions and are easier to sample. To intuitively explain, if a model has generated part of a dog, the remaining tokens must complete the dog and thus are more constrained. Empirical evidence supports our motivation: at later generation stages, the next tokens can be well predicted by a multilayer perceptron, exhibit low variance, and follow closer-to-straight-line denoising paths from noise to tokens. Based on our finding, we introduce diffusion step annealing (DiSA), a training-free method which gradually uses fewer diffusion steps as more tokens are generated, e.g., using 50 steps at the beginning and gradually decreasing to 5 steps at later stages. Because DiSA is derived from our finding specific to diffusion in autoregressive models, it is complementary to existing acceleration methods designed for diffusion alone. DiSA can be implemented in only a few lines of code on existing models, and albeit simple, achieves $5-10\times$ faster inference for MAR and Harmon and $1.4-2.5\times$ for FlowAR and xAR, while maintaining the generation quality.
Chinese: 本文提出扩散步长退火(DiSA)方法,通过随生成令牌增多而逐步减少扩散步数,在保持图像生成质量的同时,将自回归模型的推理速度最高提升10倍。
English: This paper introduces diffusion step annealing (DiSA), a training-free method that accelerates autoregressive image generation by progressively reducing diffusion steps as more tokens are generated, achieving up to 10x faster inference while preserving quality.

Authors:Michael Kirchhof, Luca Füger, Adam Goliński, Eeshan Gunesh Dhekane, Arno Blaas, Sinead Williamson
Title: Self-reflective Uncertainties: Do LLMs Know Their Internal Answer Distribution?
Abstract:
To reveal when a large language model (LLM) is uncertain about a response, uncertainty quantification commonly produces percentage numbers along with the output. But is this all we can do? We argue that in the output space of LLMs, the space of strings, exist strings expressive enough to summarize the distribution over output strings the LLM deems possible. We lay a foundation for this new avenue of uncertainty explication and present SelfReflect, a theoretically-motivated metric to assess how faithfully a string summarizes an LLM's internal answer distribution. We show that SelfReflect is able to discriminate even subtle differences of candidate summary strings and that it aligns with human judgement, outperforming alternative metrics such as LLM judges and embedding comparisons. With SelfReflect, we investigate a number of self-summarization methods and find that even state-of-the-art reasoning models struggle to explicate their internal uncertainty. But we find that faithful summarizations can be generated by sampling and summarizing. To support the development of this universal form of LLM uncertainties, we publish our metric at https://github.com/apple/ml-selfreflect
This research introduces SelfReflect, a novel metric for evaluating how accurately a string summarizes a large language model's internal uncertainty distribution over possible outputs, demonstrating its superiority over existing methods and its alignment with human judgment.
English Summary:

Authors:Di Wu, Yixin Wan, Kai-Wei Chang
Title: Visualized Text-to-Image Retrieval
Abstract:
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval that mitigates the limitations of cross-modal similarity alignment of existing multi-modal embeddings. VisRet first projects textual queries into the image modality via T2I generation. Then, it performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features. Experiments on three knowledge-intensive T2I retrieval benchmarks, including a newly introduced multi-entity benchmark, demonstrate that VisRet consistently improves T2I retrieval by 24.5% to 32.7% NDCG@10 across different embedding models. VisRet also significantly benefits downstream visual question answering accuracy when used in retrieval-augmented generation pipelines. The method is plug-and-play and compatible with off-the-shelf retrievers, making it an effective module for knowledge-intensive multi-modal systems. Our code and the new benchmark are publicly available at https://github.com/xiaowu0162/Visualize-then-Retrieve.
Chinese: VisRet提出了一种新的文本到图像检索范式,通过先将文本查询转换为图像,然后在图像模态内进行检索,显著提升了检索性能24.5%至32.7%,并有效增强了下游任务的准确性。
English: VisRet introduces a novel Text-to-Image retrieval approach by first converting text queries into images and then retrieving within the image modality, significantly improving performance by 24.5% to 32.7% across benchmarks and benefiting downstream tasks.

Authors:Weiqi Wu, Xin Guan, Shen Huang, Yong Jiang, Pengjun Xie, Fei Huang, Jiuxin Cao, Hai Zhao, Jingren Zhou
Title: MaskSearch: A Universal Pre-Training Framework to Enhance Agentic Search Capability
Abstract:
Retrieval-Augmented Language Models (RALMs) represent a classic paradigm where models enhance generative capabilities using external knowledge retrieved via a specialized module. Recent advancements in Agent techniques enable Large Language Models (LLMs) to autonomously utilize tools for retrieval, planning, and reasoning. While existing training-based methods show promise, their agentic abilities are limited by inherent characteristics of the task-specific data used during training. To further enhance the universal search capability of agents, we propose a novel pre-training framework, MaskSearch. In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs. After that, the model is trained on downstream tasks to achieve further improvement. We apply both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) for training. For SFT, we combine agent-based and distillation-based methods to generate training data, starting with a multi-agent system consisting of a planner, rewriter, observer, and followed by a self-evolving teacher model. While for RL, we employ DAPO as the training framework and adopt a hybrid reward system consisting of answer rewards and format rewards. Additionally, we introduce a curriculum learning approach that allows the model to learn progressively from easier to more challenging instances based on the number of masked spans. We evaluate the effectiveness of our framework in the scenario of open-domain multi-hop question answering. Through extensive experiments, we demonstrate that MaskSearch significantly enhances the performance of LLM-based search agents on both in-domain and out-of-domain downstream tasks.
中文: 提出的MaskSearch框架通过名为检索增强掩码预测的新型预训练任务,结合监督微调和强化学习,显著提升了大型语言模型在领域内和跨域任务中的通用搜索能力。
English: The proposed MaskSearch framework enhances large language models' universal search capabilities through a novel pre-training task called Retrieval Augmented Mask Prediction, which combines supervised fine-tuning and reinforcement learning to significantly improve performance on both in-domain and out-of-domain tasks.

Authors:Zitian Gao, Lynx Chen, Haoming Luo, Joey Zhou, Bryan Dai
Title: One-shot Entropy Minimization
Abstract:
We trained 13,440 large language models and found that entropy minimization requires only a single unlabeled data and 10 steps optimization to achieve performance improvements comparable to or even greater than those obtained using thousands of data and carefully designed rewards in rule-based reinforcement learning. This striking result may prompt a rethinking of post-training paradigms for large language models. Our code is avaliable at https://github.com/zitian-gao/one-shot-em.
中文摘要:通过对13,440个大语言模型的训练发现,仅需一个未标记数据和10步优化,熵最小化方法就能达到甚至超越基于规则的强化学习使用数千数据和精心设计奖励的效果,这一突破性成果可能促使人们重新思考大语言模型的后训练范式。
English Summary: Training 13,440 large language models revealed that entropy minimization with just one unlabeled data point and 10 optimization steps can match or surpass rule-based reinforcement learning using thousands of data points, potentially reshaping post-training approaches for such models.

Authors:Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, Jingkuan Song, Fei Huang, Yongbin Li
Title: OmniCharacter: Towards Immersive Role-Playing Agents with Seamless Speech-Language Personality Interaction
Abstract:
Role-Playing Agents (RPAs), benefiting from large language models, is an emerging interactive AI system that simulates roles or characters with diverse personalities. However, existing methods primarily focus on mimicking dialogues among roles in textual form, neglecting the role's voice traits (e.g., voice style and emotions) as playing a crucial effect in interaction, which tends to be more immersive experiences in realistic scenarios. Towards this goal, we propose OmniCharacter, a first seamless speech-language personality interaction model to achieve immersive RPAs with low latency. Specifically, OmniCharacter enables agents to consistently exhibit role-specific personality traits and vocal traits throughout the interaction, enabling a mixture of speech and language responses. To align the model with speech-language scenarios, we construct a dataset named OmniCharacter-10K, which involves more distinctive characters (20), richly contextualized multi-round dialogue (10K), and dynamic speech response (135K). Experimental results showcase that our method yields better responses in terms of both content and style compared to existing RPAs and mainstream speech-language models, with a response latency as low as 289ms. Code and dataset are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/OmniCharacter.
中文:OmniCharacter是一种首创的语音-语言个性交互模型,通过低延迟使角色扮演智能体持续展现角色特定特质和声音特征,借助丰富数据集创造沉浸式体验,在内容和风格上均优于现有方法。
English: OmniCharacter is a pioneering speech-language personality interaction model that enables Role-Playing Agents to consistently exhibit role-specific traits and vocal characteristics with low latency, creating immersive experiences through a comprehensive dataset and outperforming existing methods in both content and style.

Authors:Haoyu Wang, Zeyu Qin, Yifei Zhao, Chao Du, Min Lin, Xueqian Wang, Tianyu Pang
Title: Lifelong Safety Alignment for Language Models
Abstract:
LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments. The code is available at https://github.com/sail-sg/LifelongSafetyAlignment.
中文: 本文提出了一种终身安全对齐框架,通过元攻击器与防御器的竞争机制,使大语言模型持续适应新型越狱攻击,初始攻击成功率高达73%,经迭代训练后降至7%,从而提升开放环境下的部署安全性。
English: This paper introduces a lifelong safety alignment framework that uses a competitive Meta-Attacker and Defender to continuously adapt LLMs to evolving jailbreaking attacks, achieving a 73% attack success rate initially and reducing it to 7% after iterative training for safer deployment.

Authors:Hao Kang, Zichun Yu, Chenyan Xiong
Title: FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models
Abstract:
Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.
中文:FLAME-MoE是一个完全开源的研究平台,提供采用专家混合架构的仅解码器模型,在比密集基线准确率提升最高3.4个百分点的同时,实现了对专家专业化和路由行为的透明化分析。
English: FLAME-MoE is a fully open-source research suite that provides decoder-only models with Mixture-of-Experts architecture, improving accuracy by up to 3.4 points over dense baselines while enabling transparent analysis of expert specialization and routing behavior.

Authors:Pengxiang Li, Shilin Yan, Joey Tsai, Renrui Zhang, Ruichuan An, Ziyu Guo, Xiaowei Gao
Title: Adaptive Classifier-Free Guidance via Dynamic Low-Confidence Masking
Abstract:
Classifier-Free Guidance (CFG) significantly enhances controllability in generative models by interpolating conditional and unconditional predictions. However, standard CFG often employs a static unconditional input, which can be suboptimal for iterative generation processes where model uncertainty varies dynamically. We introduce Adaptive Classifier-Free Guidance (A-CFG), a novel method that tailors the unconditional input by leveraging the model's instantaneous predictive confidence. At each step of an iterative (masked) diffusion language model, A-CFG identifies tokens in the currently generated sequence for which the model exhibits low confidence. These tokens are temporarily re-masked to create a dynamic, localized unconditional input. This focuses CFG's corrective influence precisely on areas of ambiguity, leading to more effective guidance. We integrate A-CFG into a state-of-the-art masked diffusion language model and demonstrate its efficacy. Experiments on diverse language generation benchmarks show that A-CFG yields substantial improvements over standard CFG, achieving, for instance, a 3.9 point gain on GPQA. Our work highlights the benefit of dynamically adapting guidance mechanisms to model uncertainty in iterative generation.
Chinese: 自适应无分类器引导(A-CFG)通过基于模型实时预测置信度动态调整无条件输入,在迭代生成过程中聚焦于不确定标记进行校正引导,从而显著提升生成性能,如在GPQA上实现了3.9分的提升。
English: Adaptive Classifier-Free Guidance (A-CFG) improves generative model controllability by dynamically adjusting unconditional inputs based on the model's real-time predictive confidence, focusing corrective guidance on uncertain tokens during iterative generation and achieving significant performance gains, such as a 3.9-point increase on GPQA.

Authors:Chun-Yi Kuan, Hung-yi Lee
Title: From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data
Abstract:
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs. These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks. However, this adaptation process presents two major limitations. First, ALLMs often suffer from catastrophic forgetting, where crucial textual capabilities like instruction-following are lost after training on audio data. In some cases, models may even hallucinate sounds that are not present in the input audio, raising concerns about reliability. Second, achieving cross-modal alignment between audio and language typically relies on large collections of task-specific question-answer pairs for instruction tuning, making it resource-intensive. To address these issues, previous works have leveraged the backbone LLMs to synthesize general-purpose, caption-style alignment data. In this paper, we propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds. We further extend our approach to multi-audio scenarios, enabling the model to either explain differences between audio inputs or produce unified captions that describe all inputs, thereby enhancing audio-language alignment. We refer to the entire ALLM training framework as bootstrapping audio-language alignment via synthetic data generation from backbone LLMs (BALSa). Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance on audio understanding and reasoning benchmarks, as well as instruction-following skills. Moreover, incorporating multi-audio training further enhances the model's comprehension and reasoning capabilities. Overall, BALSa offers an efficient and scalable approach to developing ALLMs.

Authors:Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao, Yunqi Qiu, Huawei Shen, Xueqi Cheng
Title: Inference-time Alignment in Continuous Space
Abstract:
Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility. Existing methods rely on generating multiple responses from the base policy for search using a reward model, which can be considered as searching in a discrete response space. However, these methods struggle to explore informative candidates when the base policy is weak or the candidate set is small, resulting in limited effectiveness. In this paper, to address this problem, we propose Simple Energy Adaptation ($\textbf{SEA}$), a simple yet effective algorithm for inference-time alignment. In contrast to expensive search over the discrete space, SEA directly adapts original responses from the base policy toward the optimal one via gradient-based sampling in continuous latent space. Specifically, SEA formulates inference as an iterative optimization procedure on an energy function over actions in the continuous space defined by the optimal policy, enabling simple and effective alignment. For instance, despite its simplicity, SEA outperforms the second-best baseline with a relative improvement of up to $ \textbf{77.51%}$ on AdvBench and $\textbf{16.36%}$ on MATH. Our code is publicly available at https://github.com/yuanyige/sea
Chinese: 本文提出简单能量适应(SEA)方法,通过在连续潜空间中进行梯度采样来优化基础策略的原始响应,实现了推理时的高效对齐,在AdvBench和MATH基准测试中显著优于现有基线方法。
English: This paper introduces Simple Energy Adaptation (SEA), a gradient-based sampling method that aligns large language models during inference by optimizing responses in continuous latent space, significantly outperforming existing baselines on benchmarks like AdvBench and MATH.

Authors:Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao, Bolin Ding, Bingbing Xu
Title: Incentivizing Strong Reasoning from Weak Supervision
Abstract:
Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/w2sr.
中文摘要:本研究证明,利用显著弱化模型的监督可有效提升大型语言模型的推理能力,以远低于强化学习的成本实现了后者94%的性能增益。
English summary: This study demonstrates that using supervision from significantly weaker models can effectively enhance the reasoning capabilities of large language models, achieving nearly 94% of the performance gains of expensive reinforcement learning methods at a much lower cost.

Authors:Alkis Koudounas, Moreno La Quatra, Elena Baralis
Title: DeepDialogue: A Multi-Turn Emotionally-Rich Spoken Dialogue Dataset
Abstract:
Recent advances in conversational AI have demonstrated impressive capabilities in single-turn responses, yet multi-turn dialogues remain challenging for even the most sophisticated language models. Current dialogue datasets are limited in their emotional range, domain diversity, turn depth, and are predominantly text-only, hindering progress in developing more human-like conversational systems across modalities. To address these limitations, we present DeepDialogue, a large-scale multimodal dataset containing 40,150 high-quality multi-turn dialogues spanning 41 domains and incorporating 20 distinct emotions with coherent emotional progressions. Our approach pairs 9 different language models (4B-72B parameters) to generate 65,600 initial conversations, which we then evaluate through a combination of human annotation and LLM-based quality filtering. The resulting dataset reveals fundamental insights: smaller models fail to maintain coherence beyond 6 dialogue turns; concrete domains (e.g., "cars," "travel") yield more meaningful conversations than abstract ones (e.g., "philosophy"); and cross-model interactions produce more coherent dialogues than same-model conversations. A key contribution of DeepDialogue is its speech component, where we synthesize emotion-consistent voices for all 40,150 dialogues, creating the first large-scale open-source multimodal dialogue dataset that faithfully preserves emotional context across multi-turn conversations.

Authors:Zhongzhan Huang, Guoming Ling, Shanshan Zhong, Hefeng Wu, Liang Lin
Title: MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models
Abstract:
Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.
Chinese: 本文提出MiniLongBench,通过压缩LongBench基准,将评估成本降至原版的4.5%,同时保持0.97的排名相关性,为大型语言模型的长文本理解研究提供了高效评估方案。
English: This paper introduces MiniLongBench, a compressed version of the LongBench benchmark that reduces evaluation costs to just 4.5% of the original while maintaining a 0.97 rank correlation, enabling more efficient long context understanding research in large language models.

Authors:Jihyung Lee, Jin-Seop Lee, Jaehoon Lee, YunSeok Choi, Jee-Hyong Lee
Title: DCG-SQL: Enhancing In-Context Learning for Text-to-SQL with Deep Contextual Schema Link Graph
Abstract:
Text-to-SQL, which translates a natural language question into an SQL query, has advanced with in-context learning of Large Language Models (LLMs). However, existing methods show little improvement in performance compared to randomly chosen demonstrations, and significant performance drops when smaller LLMs (e.g., Llama 3.1-8B) are used. This indicates that these methods heavily rely on the intrinsic capabilities of hyper-scaled LLMs, rather than effectively retrieving useful demonstrations. In this paper, we propose a novel approach for effectively retrieving demonstrations and generating SQL queries. We construct a Deep Contextual Schema Link Graph, which contains key information and semantic relationship between a question and its database schema items. This graph-based structure enables effective representation of Text-to-SQL samples and retrieval of useful demonstrations for in-context learning. Experimental results on the Spider benchmark demonstrate the effectiveness of our approach, showing consistent improvements in SQL generation performance and efficiency across both hyper-scaled LLMs and small LLMs. The code is available at https://github.com/jjklle/DCG-SQL}{https://github.com/jjklle/DCG-SQL.
Chinese: 本文提出了一种新方法,通过构建深度上下文模式链接图来有效检索演示并生成SQL查询,在Spider基准测试中证明该方法能提升超大规模和小型大语言模型的SQL生成性能与效率。
English: This paper introduces a novel approach that constructs a Deep Contextual Schema Link Graph to effectively retrieve demonstrations and generate SQL queries, improving performance and efficiency across both hyper-scaled and small LLMs as demonstrated on the Spider benchmark.

Authors:Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, Zhiyong Wu
Title: ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
Abstract:
Large Language Models (LLMs) have extended their impact beyond Natural Language Processing, substantially fostering the development of interdisciplinary research. Recently, various LLM-based agents have been developed to assist scientific discovery progress across multiple aspects and domains. Among these, computer-using agents, capable of interacting with operating systems as humans do, are paving the way to automated scientific problem-solving and addressing routines in researchers' workflows. Recognizing the transformative potential of these agents, we introduce ScienceBoard, which encompasses two complementary contributions: (i) a realistic, multi-domain environment featuring dynamic and visually rich scientific workflows with integrated professional software, where agents can autonomously interact via different interfaces to accelerate complex research tasks and experiments; and (ii) a challenging benchmark of 169 high-quality, rigorously validated real-world tasks curated by humans, spanning scientific-discovery workflows in domains such as biochemistry, astronomy, and geoinformatics. Extensive evaluations of agents with state-of-the-art backbones (e.g., GPT-4o, Claude 3.7, UI-TARS) show that, despite some promising results, they still fall short of reliably assisting scientists in complex workflows, achieving only a 15% overall success rate. In-depth analysis further provides valuable insights for addressing current agent limitations and more effective design principles, paving the way to build more capable agents for scientific discovery. Our code, environment, and benchmark are at https://qiushisun.github.io/ScienceBoard-Home/.

Authors:Alejandro Carrasco, Victor Rodriguez-Fernandez, Richard Linares
Title: Large Language Models as Autonomous Spacecraft Operators in Kerbal Space Program
Abstract:
Recent trends are emerging in the use of Large Language Models (LLMs) as autonomous agents that take actions based on the content of the user text prompts. We intend to apply these concepts to the field of Control in space, enabling LLMs to play a significant role in the decision-making process for autonomous satellite operations. As a first step towards this goal, we have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge, a public software design competition where participants create autonomous agents for maneuvering satellites involved in non-cooperative space operations, running on the KSP game engine. Our approach leverages prompt engineering, few-shot prompting, and fine-tuning techniques to create an effective LLM-based agent that ranked 2nd in the competition. To the best of our knowledge, this work pioneers the integration of LLM agents into space research. The project comprises several open repositories to facilitate replication and further research. The codebase is accessible on \href{https://github.com/ARCLab-MIT/kspdg}{GitHub}, while the trained models and datasets are available on \href{https://huggingface.co/OhhTuRnz}{Hugging Face}. Additionally, experiment tracking and detailed results can be reviewed on \href{https://wandb.ai/carrusk/huggingface}{Weights \& Biases
中文: 本研究开创性地将大型语言模型作为自主代理应用于空间控制领域,通过提示工程和微调技术在卫星机动竞赛中获得第二名。
English: This research pioneers the use of Large Language Models as autonomous agents for space control, achieving second place in a satellite maneuvering competition through prompt engineering and fine-tuning techniques.

Authors:Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang
Title: REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Large Reasoning Models
Abstract:
Large Reasoning Models (LRMs) demonstrate strong performance in complex tasks but often face the challenge of overthinking, leading to substantially high inference costs. Existing approaches synthesize shorter reasoning responses for LRMs to learn, but are inefficient for online usage due to the time-consuming data generation and filtering processes. Meanwhile, online reinforcement learning mainly adopts a length reward to encourage short reasoning responses, but tends to lose the reflection ability and harm the performance. To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision. Besides, a reflection reward is designed to further prevent LRMs from favoring short yet non-reflective responses. Experiments show that both methods maintain or enhance performance while significantly improving inference efficiency. Their combination achieves a good balance between performance and efficiency, reducing inference costs by 35% without compromising performance. Further analysis demonstrates that our methods are effective by maintaining reflection frequency for hard problems while appropriately reducing it for simpler ones without losing reflection ability. Codes are available at https://github.com/hexuandeng/REA-RL.
中文摘要:提出的REA-RL方法通过引入反思模型进行高效在线训练并设计反思奖励,解决了大型推理模型的过度思考问题,在保持性能的同时显著提升35%的推理效率。
English Summary: The proposed REA-RL method addresses overthinking in Large Reasoning Models by combining a reflection model for efficient online training with a reflection reward, achieving 35% inference cost reduction while maintaining performance.

Authors:Sirui Chen, Shuqin Ma, Shu Yu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Title: Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks
Abstract:
Consciousness stands as one of the most profound and distinguishing features of the human mind, fundamentally shaping our understanding of existence and agency. As large language models (LLMs) develop at an unprecedented pace, questions concerning intelligence and consciousness have become increasingly significant. However, discourse on LLM consciousness remains largely unexplored territory. In this paper, we first clarify frequently conflated terminologies (e.g., LLM consciousness and LLM awareness). Then, we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce. Finally, we discuss current challenges and outline future directions in this emerging field. The references discussed in this paper are organized at https://github.com/OpenCausaLab/Awesome-LLM-Consciousness.
中文: 本文系统探讨了大型语言模型意识这一新兴领域,厘清了相关术语,整合了现有研究,并指出了潜在风险及未来研究方向。
English: This paper systematically explores the largely uncharted territory of LLM consciousness, clarifying terminology, synthesizing research, and addressing potential risks and future directions in the field.

Authors:Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem
Title: MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs
Abstract:
Metadata extraction is essential for cataloging and preserving datasets, enabling effective research discovery and reproducibility, especially given the current exponential growth in scientific research. While Masader (Alyafeai et al.,2021) laid the groundwork for extracting a wide range of metadata attributes from Arabic NLP datasets' scholarly articles, it relies heavily on manual annotation. In this paper, we present MOLE, a framework that leverages Large Language Models (LLMs) to automatically extract metadata attributes from scientific papers covering datasets of languages other than Arabic. Our schema-driven methodology processes entire documents across multiple input formats and incorporates robust validation mechanisms for consistent output. Additionally, we introduce a new benchmark to evaluate the research progress on this task. Through systematic analysis of context length, few-shot learning, and web browsing integration, we demonstrate that modern LLMs show promising results in automating this task, highlighting the need for further future work improvements to ensure consistent and reliable performance. We release the code: https://github.com/IVUL-KAUST/MOLE and dataset: https://huggingface.co/datasets/IVUL-KAUST/MOLE for the research community.
中文摘要:MOLE框架利用大语言模型自动从非阿拉伯语数据集的科学论文中提取元数据,采用模式驱动处理和验证机制,并建立了新的评估基准。
English Summary: MOLE is a framework that uses Large Language Models to automatically extract metadata from scientific papers for non-Arabic datasets, employing schema-driven processing and validation while introducing a new evaluation benchmark.

Authors:Ruihan Gong, Yue Liu, Wenjie Qu, Mingzhe Du, Yufei He, Yingwei Ma, Yulin Chen, Xiang Liu, Yi Wen, Xinfeng Li, Ruidong Wang, Xinzhong Zhu, Bryan Hooi, Jiaheng Zhang
Title: Efficient Reasoning via Chain of Unconscious Thought
Abstract:
Large Reasoning Models (LRMs) achieve promising performance but compromise token efficiency due to verbose reasoning processes. Unconscious Thought Theory (UTT) posits that complex problems can be solved more efficiently through internalized cognitive processes. Inspired by UTT, we propose a new reasoning paradigm, termed Chain of Unconscious Thought (CoUT), to improve the token efficiency of LRMs by guiding them to mimic human unconscious thought and internalize reasoning processes. Concretely, we first prompt the model to internalize the reasoning by thinking in the hidden layer. Then, we design a bag of token-efficient strategies to further help models reduce unnecessary tokens yet preserve the performance. Our work reveals that models may possess beneficial unconscious thought, enabling improved efficiency without sacrificing performance. Extensive experiments demonstrate the effectiveness of CoUT. Remarkably, it surpasses CoT by reducing token usage by 47.62% while maintaining comparable accuracy, as shown in Figure 1. The code of CoUT is available at this link: https://github.com/Rohan-GRH/CoUT
中文摘要:无意识思维链(CoUT)通过内化推理过程,显著提升大型推理模型的令牌效率,在保持与思维链方法相近准确率的同时,将令牌使用量减少47.62%。
English Summary: The Chain of Unconscious Thought (CoUT) paradigm enhances token efficiency in Large Reasoning Models by internalizing reasoning processes, reducing token usage by 47.62% while maintaining accuracy comparable to Chain of Thought methods.

Authors:Ruisheng Cao, Hanchong Zhang, Tiancheng Huang, Zhangyi Kang, Yuxin Zhang, Liangtai Sun, Hanqi Li, Yuxun Miao, Shuai Fan, Lu Chen, Kai Yu
Title: NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering
Abstract:
The increasing number of academic papers poses significant challenges for researchers to efficiently acquire key details. While retrieval augmented generation (RAG) shows great promise in large language model (LLM) based automated question answering, previous works often isolate neural and symbolic retrieval despite their complementary strengths. Moreover, conventional single-view chunking neglects the rich structure and layout of PDFs, e.g., sections and tables. In this work, we propose NeuSym-RAG, a hybrid neural symbolic retrieval framework which combines both paradigms in an interactive process. By leveraging multi-view chunking and schema-based parsing, NeuSym-RAG organizes semi-structured PDF content into both the relational database and vectorstore, enabling LLM agents to iteratively gather context until sufficient to generate answers. Experiments on three full PDF-based QA datasets, including a self-annotated one AIRQA-REAL, show that NeuSym-RAG stably defeats both the vector-based RAG and various structured baselines, highlighting its capacity to unify both retrieval schemes and utilize multiple views. Code and data are publicly available at https://github.com/X-LANCE/NeuSym-RAG.
Chinese: NeuSym-RAG 是一种混合神经符号检索框架,通过多视图分块和基于模式的解析将两种检索范式结合,使LLM代理能从PDF中迭代收集上下文,并在多个问答数据集上稳定超越现有方法。
English: NeuSym-RAG is a hybrid neural-symbolic retrieval framework that integrates both retrieval paradigms through multi-view chunking and schema-based parsing, enabling LLM agents to iteratively gather context from PDFs and outperforming existing methods on multiple QA datasets.

Authors:Yang Zhang, Yu Yu, Bo Tang, Yu Zhu, Chuxiong Sun, Wenqiang Wei, Jie Hu, Zipeng Xie, Zhiyu Li, Feiyu Xiong, Edward Chung
Title: Token-level Accept or Reject: A Micro Alignment Approach for Large Language Models
Abstract:
With the rapid development of Large Language Models (LLMs), aligning these models with human preferences and values is critical to ensuring ethical and safe applications. However, existing alignment techniques such as RLHF or DPO often require direct fine-tuning on LLMs with billions of parameters, resulting in substantial computational costs and inefficiencies. To address this, we propose Micro token-level Accept-Reject Aligning (MARA) approach designed to operate independently of the language models. MARA simplifies the alignment process by decomposing sentence-level preference learning into token-level binary classification, where a compact three-layer fully-connected network determines whether candidate tokens are "Accepted" or "Rejected" as part of the response. Extensive experiments across seven different LLMs and three open-source datasets show that MARA achieves significant improvements in alignment performance while reducing computational costs. The source code and implementation details are publicly available at https://github.com/IAAR-Shanghai/MARA, and the trained models are released at https://huggingface.co/IAAR-Shanghai/MARA_AGENTS.
中文: 提出的微令牌级接受-拒绝对齐(MARA)方法通过将句子级偏好学习转化为令牌级二元分类,在多个模型和数据集上显著提升了对齐性能并降低了计算成本。
English: The proposed Micro token-level Accept-Reject Aligning (MARA) method efficiently aligns large language models with human preferences by converting sentence-level learning into token-level classification, significantly improving performance while reducing computational costs across multiple models and datasets.

Authors:Tej Deep Pala, Panshul Sharma, Amir Zadeh, Chuan Li, Soujanya Poria
Title: Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision
Abstract:
Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs) score each intermediate step to steer generation toward coherent solutions. We introduce PathFinder-PRM, a novel hierarchical, error-aware discriminative PRM that first classifies math and consistency errors at each step, then combines these fine-grained signals to estimate step correctness. To train PathFinder-PRM, we construct a 400K-sample dataset by enriching the human-annotated PRM800K corpus and RLHFlow Mistral traces with three-dimensional step-level labels. On PRMBench, PathFinder-PRM achieves a new state-of-the-art PRMScore of 67.7, outperforming the prior best (65.5) while using 3 times less data. When applied to reward guided greedy search, our model yields prm@8 48.3, a +1.5 point gain over the strongest baseline. These results demonstrate that decoupled error detection and reward estimation not only boost fine-grained error detection but also substantially improve end-to-end, reward-guided mathematical reasoning with greater data efficiency.
中文: PathFinder-PRM作为一种分层过程奖励模型,通过检测步骤级错误并整合信号进行奖励估计,显著提升了数学推理的精确性和数据效率,实现了最优性能。
English: PathFinder-PRM is a hierarchical process reward model that enhances mathematical reasoning by detecting step-level errors and combining them for reward estimation, achieving state-of-the-art performance with improved data efficiency.

Authors:Tingjia Shen, Hao Wang, Chuan Qin, Ruijun Sun, Yang Song, Defu Lian, Hengshu Zhu, Enhong Chen
Title: GenKI: Enhancing Open-Domain Question Answering with Knowledge Integration and Controllable Generation in Large Language Models
Abstract:
Open-domain question answering (OpenQA) represents a cornerstone in natural language processing (NLP), primarily focused on extracting answers from unstructured textual data. With the rapid advancements in Large Language Models (LLMs), LLM-based OpenQA methods have reaped the benefits of emergent understanding and answering capabilities enabled by massive parameters compared to traditional methods. However, most of these methods encounter two critical challenges: how to integrate knowledge into LLMs effectively and how to adaptively generate results with specific answer formats for various task situations. To address these challenges, we propose a novel framework named GenKI, which aims to improve the OpenQA performance by exploring Knowledge Integration and controllable Generation on LLMs simultaneously. Specifically, we first train a dense passage retrieval model to retrieve associated knowledge from a given knowledge base. Subsequently, we introduce a novel knowledge integration model that incorporates the retrieval knowledge into instructions during fine-tuning to intensify the model. Furthermore, to enable controllable generation in LLMs, we leverage a certain fine-tuned LLM and an ensemble based on text consistency incorporating all coherence, fluency, and answer format assurance. Finally, extensive experiments conducted on the TriviaQA, MSMARCO, and CMRC2018 datasets, featuring diverse answer formats, have demonstrated the effectiveness of GenKI with comparison of state-of-the-art baselines. Moreover, ablation studies have disclosed a linear relationship between the frequency of retrieved knowledge and the model's ability to recall knowledge accurately against the ground truth. Our code of GenKI is available at https://github.com/USTC-StarTeam/GenKI
中文摘要:GenKI框架通过将知识库中的知识整合到大型语言模型中并实现可控生成,有效提升了开放域问答性能,在多个数据集上表现优异,并揭示了知识检索频率与准确回忆能力之间的线性关系。
English Summary: The GenKI framework enhances OpenQA by integrating knowledge from a knowledge base into LLMs and enabling controllable generation, demonstrating superior performance across diverse datasets and revealing a linear relationship between knowledge frequency and recall accuracy.

Authors:Xiaochuan Liu, Ruihua Song, Xiting Wang, Xu Chen
Title: Select, Read, and Write: A Multi-Agent Framework of Full-Text-based Related Work Generation
Abstract:
Automatic related work generation (RWG) can save people's time and effort when writing a draft of related work section (RWS) for further revision. However, existing methods for RWG always suffer from shallow comprehension due to taking the limited portions of references papers as input and isolated explanation for each reference due to ineffective capturing the relationships among them. To address these issues, we focus on full-text-based RWG task and propose a novel multi-agent framework. Our framework consists of three agents: a selector that decides which section of the papers is going to read next, a reader that digests the selected section and updates a shared working memory, and a writer that generates RWS based on the final curated memory. To better capture the relationships among references, we also propose two graph-aware strategies for selector, enabling to optimize the reading order with constrains of the graph structure. Extensive experiments demonstrate that our framework consistently improves performance across three base models and various input configurations. The graph-aware selectors outperform alternative selectors, achieving state-of-the-art results. The code and data are available at https://github.com/1190200817/Full_Text_RWG.
中文: 本研究提出了一种基于全文的多智能体框架,通过图感知策略优化参考文献间的关联理解,在自动生成相关工作部分中实现了最优性能。
English: This study introduces a multi-agent framework for automatic related work generation that utilizes full-text analysis and graph-aware strategies to enhance comprehension and relationship mapping among references, achieving state-of-the-art performance.

Authors:Junteng Liu, Yuanxiang Fan, Zhuo Jiang, Han Ding, Yongyi Hu, Chi Zhang, Yiqi Shi, Shitong Weng, Aili Chen, Shiqi Chen, Yunan Huang, Mozhi Zhang, Pengyu Zhao, Junjie Yan, Junxian He
Title: SynLogic: Synthesizing Verifiable Reasoning Data at Scale for Learning Logical Reasoning and Beyond
Abstract:
Recent advances such as OpenAI-o1 and DeepSeek R1 have demonstrated the potential of Reinforcement Learning (RL) to enhance reasoning abilities in Large Language Models (LLMs). While open-source replication efforts have primarily focused on mathematical and coding domains, methods and resources for developing general reasoning capabilities remain underexplored. This gap is partly due to the challenge of collecting diverse and verifiable reasoning data suitable for RL. We hypothesize that logical reasoning is critical for developing general reasoning capabilities, as logic forms a fundamental building block of reasoning. In this work, we present SynLogic, a data synthesis framework and dataset that generates diverse logical reasoning data at scale, encompassing 35 diverse logical reasoning tasks. The SynLogic approach enables controlled synthesis of data with adjustable difficulty and quantity. Importantly, all examples can be verified by simple rules, making them ideally suited for RL with verifiable rewards. In our experiments, we validate the effectiveness of RL training on the SynLogic dataset based on 7B and 32B models. SynLogic leads to state-of-the-art logical reasoning performance among open-source datasets, surpassing DeepSeek-R1-Distill-Qwen-32B by 6 points on BBEH. Furthermore, mixing SynLogic data with mathematical and coding tasks improves the training efficiency of these domains and significantly enhances reasoning generalization. Notably, our mixed training model outperforms DeepSeek-R1-Zero-Qwen-32B across multiple benchmarks. These findings position SynLogic as a valuable resource for advancing the broader reasoning capabilities of LLMs. We open-source both the data synthesis pipeline and the SynLogic dataset at https://github.com/MiniMax-AI/SynLogic.
中文: SynLogic框架生成可扩展且可验证的逻辑推理数据,通过强化学习训练提升大语言模型的推理能力,在结合数学和编程任务时不仅实现了最优性能,还显著增强了推理的泛化能力。
English: The SynLogic framework generates scalable, verifiable logical reasoning data that enhances LLMs' reasoning through RL training, achieving state-of-the-art performance and improving generalization when combined with mathematical and coding tasks.

Authors:Zihong Zhang, Liqi He, Zuchao Li, Lefei Zhang, Hai Zhao, Bo Du
Title: Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models
Abstract:
Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA ($\textbf{L}$arge $\textbf{L}$anguage Model-Inspired $\textbf{A}$ho-$\textbf{C}$orasick $\textbf{A}$utomaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic $n$-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available at https://github.com/hkr04/LLACA
中文摘要:本研究提出了一种基于大型语言模型的无监督分词新框架,并开发了LLACA方法,通过结合Aho-Corasick自动机与LLM的深度理解能力,实现了能根据上下文动态调整的n-gram模型,在多语言分词任务上显著优于传统方法。
English Summary: This study introduces a novel unsupervised word segmentation framework leveraging Large Language Models (LLMs) and proposes LLACA, a method combining Aho-Corasick automata with LLM insights to dynamically adapt n-gram models for enhanced performance across multiple languages.

Authors:Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li
Title: DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue
Abstract:
Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL
中文摘要:该研究提出的基于强化学习的多智能体协作框架,通过动态优化问诊策略提升临床诊断准确性,为优化医疗资源配置开辟了新途径。
English Summary: The proposed reinforcement learning-based multi-agent framework enhances clinical consultations by enabling dynamic questioning strategies that improve diagnostic accuracy and optimize medical resource allocation.

Authors:Silin Li, Yuhang Guo, Jiashu Yao, Zeming Liu, Haifeng Wang
Title: HomeBench: Evaluating LLMs in Smart Homes with Valid and Invalid Instructions Across Single and Multiple Devices
Abstract:
Large language models (LLMs) have the potential to revolutionize smart home assistants by enhancing their ability to accurately understand user needs and respond appropriately, which is extremely beneficial for building a smarter home environment. While recent studies have explored integrating LLMs into smart home systems, they primarily focus on handling straightforward, valid single-device operation instructions. However, real-world scenarios are far more complex and often involve users issuing invalid instructions or controlling multiple devices simultaneously. These have two main challenges: LLMs must accurately identify and rectify errors in user instructions and execute multiple user instructions perfectly. To address these challenges and advance the development of LLM-based smart home assistants, we introduce HomeBench, the first smart home dataset with valid and invalid instructions across single and multiple devices in this paper. We have experimental results on 13 distinct LLMs; e.g., GPT-4o achieves only a 0.0% success rate in the scenario of invalid multi-device instructions, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning, retrieval-augmented generation, and fine-tuning. Our code and dataset are publicly available at https://github.com/BITHLP/HomeBench.
中文摘要:本文提出了首个针对智能家居助手中无效和多设备指令挑战的数据集HomeBench,揭示了即使如GPT-4o等先进模型在现有增强技术下仍难以应对复杂现实场景。
English Summary: This paper introduces HomeBench, the first dataset addressing the challenges of invalid and multi-device instructions for LLM-based smart home assistants, revealing that even advanced models like GPT-4o struggle with complex real-world scenarios despite existing enhancement techniques.

Authors:George Kour, Itay Nakash, Ateret Anaby-Tavor, Michal Shmueli-Scheuer
Title: Think Again! The Effect of Test-Time Compute on Preferences, Opinions, and Beliefs of Large Language Models
Abstract:
As Large Language Models (LLMs) become deeply integrated into human life and increasingly influence decision-making, it's crucial to evaluate whether and to what extent they exhibit subjective preferences, opinions, and beliefs. These tendencies may stem from biases within the models, which may shape their behavior, influence the advice and recommendations they offer to users, and potentially reinforce certain viewpoints. This paper presents the Preference, Opinion, and Belief survey (POBs), a benchmark developed to assess LLMs' subjective inclinations across societal, cultural, ethical, and personal domains. We applied our benchmark to evaluate leading open- and closed-source LLMs, measuring desired properties such as reliability, neutrality, and consistency. In addition, we investigated the effect of increasing the test-time compute, through reasoning and self-reflection mechanisms, on those metrics. While effective in other tasks, our results show that these mechanisms offer only limited gains in our domain. Furthermore, we reveal that newer model versions are becoming less consistent and more biased toward specific viewpoints, highlighting a blind spot and a concerning trend. POBS: https://ibm.github.io/POBS

Authors:Xuandong Zhao, Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song
Title: Learning to Reason without External Rewards
Abstract:
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
中文摘要:提出的Intuitor框架通过将模型自身置信度作为内在奖励信号,使大语言模型能够在无需外部监督的情况下学习复杂推理,在保持与监督方法相当性能的同时,实现了跨领域任务的更优泛化能力。
English Summary: The proposed Intuitor framework enables large language models to learn complex reasoning through self-certainty as an intrinsic reward signal, achieving comparable performance to supervised methods while demonstrating superior generalization across domains without requiring external supervision.

Authors:Jintao Tong, Wenwei Jin, Pengda Qin, Anqi Li, Yixiong Zou, Yuhong Li, Yuhua Li, Ruixuan Li
Title: FlowCut: Rethinking Redundancy via Information Flow for Efficient Vision-Language Models
Abstract:
Large vision-language models (LVLMs) excel at multimodal understanding but suffer from high computational costs due to redundant vision tokens. Existing pruning methods typically rely on single-layer attention scores to rank and prune redundant visual tokens to solve this inefficiency. However, as the interaction between tokens and layers is complicated, this raises a basic question: Is such a simple single-layer criterion sufficient to identify redundancy? To answer this question, we rethink the emergence of redundant visual tokens from a fundamental perspective: information flow, which models the interaction between tokens and layers by capturing how information moves between tokens across layers. We find (1) the CLS token acts as an information relay, which can simplify the complicated flow analysis; (2) the redundancy emerges progressively and dynamically via layer-wise attention concentration; and (3) relying solely on attention scores from single layers can lead to contradictory redundancy identification. Based on this, we propose FlowCut, an information-flow-aware pruning framework, mitigating the insufficiency of the current criterion for identifying redundant tokens and better aligning with the model's inherent behaviors. Extensive experiments show that FlowCut achieves superior results, outperforming SoTA by 1.6% on LLaVA-1.5-7B with 88.9% token reduction, and by 4.3% on LLaVA-NeXT-7B with 94.4% reduction, delivering 3.2x speed-up in the prefilling stage. Our code is available at https://github.com/TungChintao/FlowCut
中文: 大型视觉语言模型因冗余视觉令牌导致计算效率低下,现有剪枝方法仅依赖单层注意力评分存在不足,因此提出FlowCut这一信息流感知框架,通过更贴合模型内在行为来提升性能与速度。
English: Large vision-language models face computational inefficiency from redundant visual tokens, which existing pruning methods inadequately address by relying on single-layer attention scores, prompting the development of FlowCut, an information-flow-aware framework that enhances performance and speed by better aligning with the model's inherent behaviors.

Authors:Yejin Lee, Joonghyuk Hahn, Hyeseon Ahn, Yo-Sub Han
Title: AmpleHate: Amplifying the Attention for Versatile Implicit Hate Detection
Abstract:
Implicit hate speech detection is challenging due to its subtlety and reliance on contextual interpretation rather than explicit offensive words. Current approaches rely on contrastive learning, which are shown to be effective on distinguishing hate and non-hate sentences. Humans, however, detect implicit hate speech by first identifying specific targets within the text and subsequently interpreting how these target relate to their surrounding context. Motivated by this reasoning process, we propose AmpleHate, a novel approach designed to mirror human inference for implicit hate detection. AmpleHate identifies explicit target using a pretrained Named Entity Recognition model and capture implicit target information via [CLS] tokens. It computes attention-based relationships between explicit, implicit targets and sentence context and then, directly injects these relational vectors into the final sentence representation. This amplifies the critical signals of target-context relations for determining implicit hate. Experiments demonstrate that AmpleHate achieves state-of-the-art performance, outperforming contrastive learning baselines by an average of 82.14% and achieve faster convergence. Qualitative analyses further reveal that attention patterns produced by AmpleHate closely align with human judgement, underscoring its interpretability and robustness. Our code is publicly available at: https://github.com/leeyejin1231/AmpleHate.
Chinese Summary: AmpleHate提出了一种新颖的隐式仇恨言论检测方法,通过模拟人类推理过程识别目标与上下文关系,在性能和收敛速度上均优于现有技术。
English Summary: AmpleHate introduces a novel method for detecting implicit hate speech by mimicking human reasoning, using target identification and context relationships to achieve superior performance and faster convergence compared to existing approaches.

Authors:Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo
Title: LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study
Abstract:
The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.
中文: 大语言模型在场景图理解方面表现出色,但在从复杂叙事生成场景图时存在困难,这凸显了该领域需要改进方法。
English: Large Language Models demonstrate strong scene graph understanding but face challenges in generating scene graphs from complex narratives, highlighting the need for improved methodologies in this area.

Authors:Pingzhi Li, Zhen Tan, Huaizhi Qu, Huan Liu, Tianlong Chen
Title: DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation
Abstract:
Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD).In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs remain accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving or even improving the original performance of the teacher model, student models distilled from the defensively generated teacher outputs demonstrate catastrophically reduced performance, demonstrating our method's effectiveness as a practical safeguard against KD-based model imitation.
中文: 本文提出防御性输出生成(DOGe)策略,通过微调教师大语言模型的最后一层,使其输出在保持对合法用户有效的同时,严重破坏基于知识蒸馏的模型模仿效果。
English: This paper introduces Defensive Output Generation (DOGe), an efficient strategy that fine-tunes the final layer of a teacher LLM to subtly alter its outputs, preserving utility for legitimate users while severely degrading performance in distillation-based imitation attempts.

Authors:Guilong Lu, Xuntao Guo, Rongjunchen Zhang, Wenqiao Zhu, Ji Liu
Title: BizFinBench: A Business-Driven Real-World Financial Benchmark for Evaluating LLMs
Abstract:
Large language models excel in general tasks, yet assessing their reliability in logic-heavy, precision-critical domains like finance, law, and healthcare remains challenging. To address this, we introduce BizFinBench, the first benchmark specifically designed to evaluate LLMs in real-world financial applications. BizFinBench consists of 6,781 well-annotated queries in Chinese, spanning five dimensions: numerical calculation, reasoning, information extraction, prediction recognition, and knowledge-based question answering, grouped into nine fine-grained categories. The benchmark includes both objective and subjective metrics. We also introduce IteraJudge, a novel LLM evaluation method that reduces bias when LLMs serve as evaluators in objective metrics. We benchmark 25 models, including both proprietary and open-source systems. Extensive experiments show that no model dominates across all tasks. Our evaluation reveals distinct capability patterns: (1) In Numerical Calculation, Claude-3.5-Sonnet (63.18) and DeepSeek-R1 (64.04) lead, while smaller models like Qwen2.5-VL-3B (15.92) lag significantly; (2) In Reasoning, proprietary models dominate (ChatGPT-o3: 83.58, Gemini-2.0-Flash: 81.15), with open-source models trailing by up to 19.49 points; (3) In Information Extraction, the performance spread is the largest, with DeepSeek-R1 scoring 71.46, while Qwen3-1.7B scores 11.23; (4) In Prediction Recognition, performance variance is minimal, with top models scoring between 39.16 and 50.00. We find that while current LLMs handle routine finance queries competently, they struggle with complex scenarios requiring cross-concept reasoning. BizFinBench offers a rigorous, business-aligned benchmark for future research. The code and dataset are available at https://github.com/HiThink-Research/BizFinBench.
中文: BizFinBench作为首个金融领域专用基准,通过6,781条标注中文查询和创新的IteraJudge评估方法,系统评估了25个模型在五大金融能力维度的表现,发现现有模型在跨概念推理等复杂场景仍存在明显不足。
English: BizFinBench is a specialized benchmark with 6,781 annotated Chinese queries to rigorously evaluate LLMs in financial applications, revealing significant performance gaps across tasks and introducing IteraJudge to reduce evaluation bias.

Authors:Abhijnan Nath, Carine Graff, Andrei Bachinin, Nikhil Krishnaswamy
Title: Frictional Agent Alignment Framework: Slow Down and Don't Break Things
Abstract:
AI support of collaborative interactions entails mediating potential misalignment between interlocutor beliefs. Common preference alignment methods like DPO excel in static settings, but struggle in dynamic collaborative tasks where the explicit signals of interlocutor beliefs are sparse and skewed. We propose the Frictional Agent Alignment Framework (FAAF), to generate precise, context-aware "friction" that prompts for deliberation and re-examination of existing evidence. FAAF's two-player objective decouples from data skew: a frictive-state policy identifies belief misalignments, while an intervention policy crafts collaborator-preferred responses. We derive an analytical solution to this objective, enabling training a single policy via a simple supervised loss. Experiments on three benchmarks show FAAF outperforms competitors in producing concise, interpretable friction and in OOD generalization. By aligning LLMs to act as adaptive "thought partners" -- not passive responders -- FAAF advances scalable, dynamic human-AI collaboration. Our code and data can be found at https://github.com/csu-signal/FAAF_ACL.
中文摘要:摩擦代理对齐框架(FAAF)通过生成情境感知的“摩擦”来促进动态协作中的信念对齐,其解耦的双策略设计在可解释性和泛化能力上均优于现有方法。
English Summary: The Frictional Agent Alignment Framework (FAAF) addresses belief misalignment in dynamic AI collaboration by generating contextual friction that prompts evidence re-examination, outperforming existing methods in interpretability and generalization.

Authors:Kidist Amde Mekonnen, Yosef Worku Alemneh, Maarten de Rijke
Title: Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval
Abstract:
Neural retrieval methods using transformer-based pre-trained language models have advanced multilingual and cross-lingual retrieval. However, their effectiveness for low-resource, morphologically rich languages such as Amharic remains underexplored due to data scarcity and suboptimal tokenization. We address this gap by introducing Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones. Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10 and a 9.86% gain in Recall@10 over the strongest multilingual baseline, Arctic Embed 2.0 (568M parameters). More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M), remain competitive while being over 13x smaller. Additionally, we train a ColBERT-based late interaction retrieval model that achieves the highest MRR@10 score (0.843) among all evaluated models. We benchmark our proposed models against both sparse and dense retrieval baselines to systematically assess retrieval effectiveness in Amharic. Our analysis highlights key challenges in low-resource settings and underscores the importance of language-specific adaptation. To foster future research in low-resource IR, we publicly release our dataset, codebase, and trained models at https://github.com/kidist-amde/amharic-ir-benchmarks.
Chinese: 本研究针对阿姆哈拉语开发了专门的密集检索模型,相比现有多语言基线实现了高达17.6%的性能提升,同时模型体积大幅缩小,并公开了全部数据集与代码以推动低资源信息检索研究。
English: This research introduces Amharic-specific dense retrieval models that significantly outperform existing multilingual baselines, achieving up to 17.6% improvement in retrieval metrics while being substantially more compact, with all resources made publicly available to advance low-resource information retrieval.

Authors:Junnan Liu, Linhao Luo, Thuy-Trang Vu, Gholamreza Haffari
Title: SituatedThinker: Grounding LLM Reasoning with Real-World through Situated Thinking
Abstract:
Recent advances in large language models (LLMs) demonstrate their impressive reasoning capabilities. However, the reasoning confined to internal parametric space limits LLMs' access to real-time information and understanding of the physical world. To overcome this constraint, we introduce SituatedThinker, a novel framework that enables LLMs to ground their reasoning in real-world contexts through situated thinking, which adaptively combines both internal knowledge and external information with predefined interfaces. By utilizing reinforcement learning, SituatedThinker incentivizes deliberate reasoning with the real world to acquire information and feedback, allowing LLMs to surpass their knowledge boundaries and enhance reasoning. Experimental results demonstrate significant performance improvements on multi-hop question-answering and mathematical reasoning benchmarks. Furthermore, SituatedThinker demonstrates strong performance on unseen tasks, such as KBQA, TableQA, and text-based games, showcasing the generalizable real-world grounded reasoning capability. Our codes are available at https://github.com/jnanliu/SituatedThinker.
中文摘要:SituatedThinker框架通过强化学习将现实世界情境融入大型语言模型的推理过程,显著提升了多项基准测试和未知任务的表现。
English Summary: The SituatedThinker framework enhances large language models' reasoning by integrating real-world contexts through reinforcement learning, significantly improving performance on various benchmarks and unseen tasks.

Authors:Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger
Title: LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Abstract:
Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations such as failures in reasoning, hallucinations, and limited multilingual capability. While prior reviews have addressed these issues, they often focus on individual limitations or consider them within the broader context of evaluating overall model performance. This survey addresses the gap by presenting a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to 2025, using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we extract 14,648 relevant limitation papers using keyword filtering and LLM-based classification, validated against expert labels. Using topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM), we identify between 7 and 15 prominent types of limitations discussed in recent LLM research across the ACL and arXiv datasets. We find that LLM-related research increases nearly sixfold in ACL and nearly fifteenfold in arXiv between 2022 and 2025, while LLLMs research grows even faster, by a factor of over 12 in ACL and nearly 28 in arXiv. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward safety and controllability (with topics like security risks, alignment, hallucinations, knowledge editing), and multimodality between 2022 and 2025. We offer a quantitative view of trends in LLM limitations research and release a dataset of annotated abstracts and a validated methodology, available at: https://github.com/a-kostikova/LLLMs-Survey.
中文摘要:本综述通过数据驱动方法分析了2022至2025年间大语言模型的局限性,发现推理能力是最受关注的研究短板,同时揭示了相关研究的加速增长态势及不同学术数据集中的研究主题演变。
English Summary: This survey provides a data-driven analysis of large language model limitations from 2022-2025, identifying reasoning as the most studied constraint while documenting accelerated research growth and evolving focus areas across academic datasets.

Authors:Steffen Backmann, David Guzman Piedrahita, Emanuel Tewolde, Rada Mihalcea, Bernhard Schölkopf, Zhijing Jin
Title: When Ethics and Payoffs Diverge: LLM Agents in Morally Charged Social Dilemmas
Abstract:
Recent advances in large language models (LLMs) have enabled their use in complex agentic roles, involving decision-making with humans or other agents, making ethical alignment a key AI safety concern. While prior work has examined both LLMs' moral judgment and strategic behavior in social dilemmas, there is limited understanding of how they act when moral imperatives directly conflict with rewards or incentives. To investigate this, we introduce Moral Behavior in Social Dilemma Simulation (MoralSim) and evaluate how LLMs behave in the prisoner's dilemma and public goods game with morally charged contexts. In MoralSim, we test a range of frontier models across both game structures and three distinct moral framings, enabling a systematic examination of how LLMs navigate social dilemmas in which ethical norms conflict with payoff-maximizing strategies. Our results show substantial variation across models in both their general tendency to act morally and the consistency of their behavior across game types, the specific moral framing, and situational factors such as opponent behavior and survival risks. Crucially, no model exhibits consistently moral behavior in MoralSim, highlighting the need for caution when deploying LLMs in agentic roles where the agent's "self-interest" may conflict with ethical expectations. Our code is available at https://github.com/sbackmann/moralsim.
中文: 该研究通过MoralSim评估大语言模型在道德规范与激励冲突的社会困境中的行为,发现模型存在显著行为差异且无一能保持一贯道德表现,突显了其在代理角色部署中的风险。
English: The study introduces MoralSim to assess how large language models navigate social dilemmas where ethical norms conflict with incentives, revealing significant behavioral inconsistencies and no model's consistently moral performance, underscoring risks in agentic deployments.

Authors:Yunhai Hu, Tianhua Xia, Zining Liu, Rahul Raman, Xingyu Liu, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang
Title: DREAM: Drafting with Refined Target Features and Entropy-Adaptive Cross-Attention Fusion for Multimodal Speculative Decoding
Abstract:
Speculative decoding (SD) has emerged as a powerful method for accelerating autoregressive generation in large language models (LLMs), yet its integration into vision-language models (VLMs) remains underexplored. We introduce DREAM, a novel speculative decoding framework tailored for VLMs that combines three key innovations: (1) a cross-attention-based mechanism to inject intermediate features from the target model into the draft model for improved alignment, (2) adaptive intermediate feature selection based on attention entropy to guide efficient draft model training, and (3) visual token compression to reduce draft model latency. DREAM enables efficient, accurate, and parallel multimodal decoding with significant throughput improvement. Experiments across a diverse set of recent popular VLMs, including LLaVA, Pixtral, SmolVLM and Gemma3, demonstrate up to 3.6x speedup over conventional decoding and significantly outperform prior SD baselines in both inference throughput and speculative draft acceptance length across a broad range of multimodal benchmarks. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM.git
中文: DREAM为视觉语言模型提出了一种创新的推测解码框架,通过交叉注意力对齐、自适应特征选择和视觉标记压缩技术,将推理吞吐量最高提升3.6倍。
English: DREAM introduces a novel speculative decoding framework for vision-language models that enhances inference throughput up to 3.6x through cross-attention alignment, adaptive feature selection, and visual token compression.

Authors:Pradyumna Shyama Prasad, Minh Nhat Nguyen
Title: When Two LLMs Debate, Both Think They'll Win
Abstract:
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models' private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles. Code for our experiments is available at https://github.com/pradyuprasad/llms_overconfidence
中文摘要:本研究表明大型语言模型在对抗性辩论中表现出系统性过度自信且无法恰当调整置信度,对其在动态多轮任务中的可靠性提出了重要质疑。
English Summary: This study reveals that large language models exhibit systematic overconfidence and fail to properly adjust their confidence in adversarial debates, raising concerns about their reliability in dynamic, multi-turn tasks.

Authors:Zhuo Liu, Moxin Li, Xun Deng, Qifan Wang, Fuli Feng
Title: Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Abstract:
LLM-as-a-Judge employs large language models (LLMs), such as GPT-4, to evaluate the quality of LLM-generated responses, gaining popularity for its cost-effectiveness and strong alignment with human evaluations. However, training proxy judge models using evaluation data generated by powerful teacher models introduces a critical yet previously overlooked issue: teacher preference bias, where the proxy judge model learns a biased preference for responses from the teacher model. To tackle this problem, we propose a novel setting that incorporates an additional assistant model, which is not biased toward the teacher model's responses, to complement the training data. Building on this setup, we introduce AGDe-Judge, a three-stage framework designed to debias from both the labels and feedbacks in the training data. Extensive experiments demonstrate that AGDe-Judge effectively reduces teacher preference bias while maintaining strong performance across six evaluation benchmarks. Code is available at https://github.com/Liuz233/AGDe-Judge.
中文: LLM-as-a-Judge 利用大语言模型评估生成内容时存在教师偏好偏差,AGDe-Judge 通过引入无偏助理模型和三阶段去偏框架,在六个基准测试中有效降低偏差并保持性能。
English: LLM-as-a-Judge uses models like GPT-4 to assess LLM outputs efficiently but faces teacher preference bias, which AGDe-Judge addresses by incorporating an unbiased assistant model and a three-stage debiasing framework to maintain performance across benchmarks.

Authors:Xuyang Liu, Zichen Wen, Shaobo Wang, Junjie Chen, Zhishan Tao, Yubo Wang, Xiangqi Jin, Chang Zou, Yiyu Wang, Chenfei Liao, Xu Zheng, Honggang Chen, Weijia Li, Xuming Hu, Conghui He, Linfeng Zhang
Title: Shifting AI Efficiency From Model-Centric to Data-Centric Compression
Abstract:
The rapid advancement of large language models (LLMs) and multi-modal LLMs (MLLMs) has historically relied on model-centric scaling through increasing parameter counts from millions to hundreds of billions to drive performance gains. However, as we approach hardware limits on model size, the dominant computational bottleneck has fundamentally shifted to the quadratic cost of self-attention over long token sequences, now driven by ultra-long text contexts, high-resolution images, and extended videos. In this position paper, \textbf{we argue that the focus of research for efficient AI is shifting from model-centric compression to data-centric compression}. We position token compression as the new frontier, which improves AI efficiency via reducing the number of tokens during model training or inference. Through comprehensive analysis, we first examine recent developments in long-context AI across various domains and establish a unified mathematical framework for existing model efficiency strategies, demonstrating why token compression represents a crucial paradigm shift in addressing long-context overhead. Subsequently, we systematically review the research landscape of token compression, analyzing its fundamental benefits and identifying its compelling advantages across diverse scenarios. Furthermore, we provide an in-depth analysis of current challenges in token compression research and outline promising future directions. Ultimately, our work aims to offer a fresh perspective on AI efficiency, synthesize existing research, and catalyze innovative developments to address the challenges that increasing context lengths pose to the AI community's advancement.
中文: AI效率研究的重点正从模型为中心的扩展转向数据为中心的令牌压缩,以解决长序列在大语言模型中引起的计算瓶颈问题。
English: The focus of AI efficiency research is shifting from model-centric scaling to data-centric token compression to overcome computational bottlenecks caused by long sequences in large language models.

Authors:Wenyang Luo, Wayne Xin Zhao, Jing Sha, Shijin Wang, Ji-Rong Wen
Title: MMATH: A Multilingual Benchmark for Mathematical Reasoning
Abstract:
The advent of large reasoning models, such as OpenAI o1 and DeepSeek R1, has significantly advanced complex reasoning tasks. However, their capabilities in multilingual complex reasoning remain underexplored, with existing efforts largely focused on simpler tasks like MGSM. To address this gap, we introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages. Using MMATH, we observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages. To address this, we explore strategies including prompting and training, demonstrating that reasoning in English and answering in target languages can simultaneously enhance performance and preserve target-language consistency. Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models. Our code and data could be found at https://github.com/RUCAIBox/MMATH.
中文总结:MMATH基准测试揭示了先进模型在多语言复杂推理中存在显著性能差距和语言偏离问题,而采用英语推理结合目标语言回答的策略可有效提升表现并保持语言一致性。
English Summary: The MMATH benchmark reveals significant performance gaps and off-target language issues in multilingual complex reasoning by advanced models, with strategies like reasoning in English and answering in target languages proving effective for improvement.

Authors:Zheng Chu, Huiming Fan, Jingchang Chen, Qianyu Wang, Mingda Yang, Jiafeng Liang, Zhongjie Wang, Hao Li, Guo Tang, Ming Liu, Bing Qin
Title: Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering
Abstract:
Although large language models (LLMs) have demonstrated remarkable reasoning capabilities, they still face challenges in knowledge-intensive multi-hop reasoning. Recent work explores iterative retrieval to address complex problems. However, the lack of intermediate guidance often results in inaccurate retrieval and flawed intermediate reasoning, leading to incorrect reasoning. To address these, we propose Self-Critique Guided Iterative Reasoning (SiGIR), which uses self-critique feedback to guide the iterative reasoning process. Specifically, through end-to-end training, we enable the model to iteratively address complex problems via question decomposition. Additionally, the model is able to self-evaluate its intermediate reasoning steps. During iterative reasoning, the model engages in branching exploration and employs self-evaluation to guide the selection of promising reasoning trajectories. Extensive experiments on three multi-hop reasoning datasets demonstrate the effectiveness of our proposed method, surpassing the previous SOTA by $8.6\%$. Furthermore, our thorough analysis offers insights for future research. Our code, data, and models are available at Github: https://github.com/zchuz/SiGIR-MHQA.
Chinese: 提出的SiGIR方法通过自我批判反馈指导迭代问题分解和路径选择,在多跳推理任务中实现了比先前最优方法8.6%的性能提升。
English: The proposed SiGIR method enhances multi-hop reasoning by using self-critique feedback to guide iterative question decomposition and trajectory selection, achieving an 8.6% improvement over previous state-of-the-art approaches.

Authors:Benjamin Clavié, Florian Brand
Title: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models
Abstract:
Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .
中文摘要:ReadBench是一个专门评估大型视觉语言模型在文本密集图像上阅读理解能力的新基准,发现模型在处理长篇内容时表现显著下降,而文本分辨率影响甚微。
English Summary: ReadBench is a new benchmark designed to assess how well Large Vision-Language Models read and reason about text-rich images, revealing significant performance drops with longer textual content despite minimal impact from text resolution.

Authors:Jaemin Kim, Hangeol Chang, Hyunmin Hwang, Choonghan Kim, Jong Chul Ye
Title: Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs
Abstract:
Large Language Models (LLMs) have demonstrated remarkable general capabilities, but enhancing skills such as reasoning often demands substantial computational resources and may compromise their generalization. While Parameter-Efficient Fine-Tuning (PEFT) methods offer a more resource-conscious alternative, they typically requires retraining for each LLM backbone due to architectural dependencies. To address these challenges, here we propose Universal Reasoner (UniR) - a single, lightweight, composable, and plug-and-play reasoning module that can be used with any frozen LLM to endow it with specialized reasoning capabilities. Specifically, UniR decomposes the reward into a standalone reasoning module that is trained independently using predefined rewards, effectively translating trajectory-level signals into token-level guidance. Once trained, UniR can be combined with any frozen LLM at inference time by simply adding its output logits to those of the LLM backbone. This additive structure naturally enables modular composition: multiple UniR modules trained for different tasks can be jointly applied by summing their logits, enabling complex reasoning via composition. Experimental results on mathematical reasoning and machine translation tasks show that UniR significantly outperforms existing baseline fine-tuning methods using the Llama3.2 model. Furthermore, UniR demonstrates strong weak-to-strong generalization: reasoning modules trained on smaller models effectively guide much larger LLMs. This makes UniR a cost-efficient, adaptable, and robust solution for enhancing reasoning in LLMs without compromising their core capabilities. Code is open-sourced at https://github.com/hangeol/UniR
中文: Universal Reasoner (UniR) 是一种轻量级即插即用推理模块,可与任何冻结的大型语言模型结合,无需重新训练即可增强其专业推理能力,在多项任务中超越现有方法,实现了高效、自适应的推理增强。
English: The Universal Reasoner (UniR) is a lightweight, plug-and-play reasoning module that can be added to any frozen large language model to enhance its specialized reasoning capabilities without retraining, outperforming existing methods and enabling cost-efficient, adaptable reasoning across tasks.

Authors:Ke-Han Lu, Chun-Yi Kuan, Hung-yi Lee
Title: Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models
Abstract:
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities and quantify catastrophic forgetting in speech-aware language models (SLMs). Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training. Existing benchmarks conflate speech perception with instruction-following, hindering evaluation of these distinct skills. To address this gap, we provide a benchmark for diagnosing the instruction-following abilities of SLMs. Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs. Additionally, these models are highly sensitive to prompt variations, often yielding inconsistent and unreliable outputs. We highlight core challenges and provide insights to guide future research, emphasizing the need for evaluation beyond task-level metrics.
中文: 我们推出Speech-IFeval评估框架,旨在检测语音增强语言模型的指令遵循能力并量化其灾难性遗忘问题,发现这些模型在执行基本指令时表现远逊于纯文本模型且对提示变化极为敏感。
English: We present Speech-IFeval, a framework that evaluates instruction-following abilities and measures catastrophic forgetting in speech-aware language models, revealing their significant struggles with basic instructions and sensitivity to prompts compared to text-based models.

Authors:Yunxin Li, Xinyu Chen, Zitao Li, Zhenyu Liu, Longyue Wang, Wenhan Luo, Baotian Hu, Min Zhang
Title: VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization
Abstract:
Applying Reinforcement Learning (RL) to Video Large Language Models (Video-LLMs) shows significant promise for complex video reasoning. However, popular Reinforcement Fine-Tuning (RFT) methods, such as outcome-based Group Relative Policy Optimization (GRPO), are limited by data preparation bottlenecks (e.g., noise or high cost) and exhibit unstable improvements in the quality of long chain-of-thoughts (CoTs) and downstream performance.To address these limitations, we propose VerIPO, a Verifier-guided Iterative Policy Optimization method designed to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains. The core component is Rollout-Aware Verifier, positioned between the GRPO and Direct Preference Optimization (DPO) training phases to form the GRPO-Verifier-DPO training loop. This verifier leverages small LLMs as a judge to assess the reasoning logic of rollouts, enabling the construction of high-quality contrastive data, including reflective and contextually consistent CoTs. These curated preference samples drive the efficient DPO stage (7x faster than GRPO), leading to marked improvements in reasoning chain quality, especially in terms of length and contextual consistency. This training loop benefits from GRPO's expansive search and DPO's targeted optimization. Experimental results demonstrate: 1) Significantly faster and more effective optimization compared to standard GRPO variants, yielding superior performance; 2) Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs, producing long and contextually consistent CoTs on diverse video reasoning tasks; and 3) Our model with one iteration outperforms powerful LMMs (e.g., Kimi-VL) and long reasoning models (e.g., Video-R1), highlighting its effectiveness and stability.
Chinese Summary: 强化学习应用于视频大语言模型虽前景广阔,但存在数据瓶颈和不稳定性问题,提出的VerIPO方法通过验证器引导的迭代优化,有效提升推理链质量并实现高效训练。
English Summary: Reinforcement Learning applied to Video Large Language Models shows promise but faces data bottlenecks and instability, which the proposed VerIPO method overcomes by using a verifier-guided iterative optimization to enhance reasoning chain quality efficiently.

Authors:Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
Title: STRICT: Stress Test of Rendering Images Containing Text
Abstract:
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
中文:扩散模型在生成逼真图像方面表现出色,但在图像中生成一致且清晰文本方面存在不足,因此我们开发了STRICT基准来系统评估文本长度、正确性和指令遵循能力,揭示了持续存在的局限并推动了未来研究方向。
English: Diffusion models excel in realistic image generation but fail to produce consistent and legible text, leading to the creation of the STRICT benchmark for systematic evaluation across text length, correctness, and instruction adherence, revealing persistent limitations and motivating future research.

Authors:Saman Sarker Joy
Title: BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
Abstract:
The Massive Multitask Language Understanding (MMLU) benchmark has been widely used to evaluate language models across various domains. However, existing MMLU datasets primarily focus on high-resource languages such as English, which leaves low-resource languages like Bengali underrepresented. In this paper, we introduce BnMMLU, a benchmark to evaluate the multitask language understanding capabilities of Bengali in language models. The dataset spans 23 domains, including science, humanities, mathematics and general knowledge and is structured in a multiple-choice format to assess factual knowledge, application-based problem-solving and reasoning abilities of language models. It consists of 138,949 question-option pairs. We benchmark several proprietary and open-source large language models (LLMs) on the BnMMLU test set. Additionally, we annotate the test set with three cognitive categories-factual knowledge, procedural application and reasoning-to gain deeper insights into model strengths and weaknesses across various cognitive tasks. The results reveal significant performance gaps, highlighting the need for improved pre-training and fine-tuning strategies tailored to Bengali data. We release the dataset and benchmark results to facilitate further research in this area.
中文: 本文提出了BnMMLU,这是一个涵盖23个领域、包含138,949个问题选项对的孟加拉语综合基准,揭示了语言模型存在的显著性能差距,并强调需要针对孟加拉语数据改进相关策略。
English: The paper introduces BnMMLU, a comprehensive Bengali benchmark spanning 23 domains with 138,949 question-option pairs, revealing significant performance gaps in language models and highlighting the need for improved strategies tailored to Bengali data.

Authors:Xuanming Zhang, Yuxuan Chen, Min-Hsuan Yeh, Yixuan Li
Title: MetaMind: Modeling Human Social Thoughts with Metacognitive Multi-Agent Systems
Abstract:
Human social interactions depend on the ability to infer others' unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses user mental states (e.g., intent, emotion), (2) a Domain Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework's ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.
中文摘要:MetaMind是一个多智能体框架,通过模拟人类思维的三阶段协作推理来增强人工智能的社交智能,在心理理论任务中实现显著提升并首次达到人类水平表现。
English Summary: MetaMind is a multi-agent framework that enhances AI social intelligence by simulating human-like reasoning through three collaborative stages, achieving significant improvements in Theory of Mind tasks and matching human performance for the first time.

Authors:Alexander Shabalin, Viacheslav Meshchaninov, Dmitry Vetrov
Title: Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation
Abstract:
Diffusion models have achieved state-of-the-art performance in generating images, audio, and video, but their adaptation to text remains challenging due to its discrete nature. Prior approaches either apply Gaussian diffusion in continuous latent spaces, which inherits semantic structure but struggles with token decoding, or operate in categorical simplex space, which respect discreteness but disregard semantic relation between tokens. In this paper, we propose Smoothing Diffusion on Token Embeddings (Smoothie), a novel diffusion method that combines the strengths of both approaches by progressively smoothing token embeddings based on semantic similarity. This technique enables gradual information removal while maintaining a natural decoding process. Experimental results on several sequence-to-sequence generation tasks demonstrate that Smoothie outperforms existing diffusion-based models in generation quality. Furthermore, ablation studies show that our proposed diffusion space yields better performance than both the standard embedding space and the categorical simplex. Our code is available at https://github.com/ashaba1in/smoothie.
中文摘要:Smoothie是一种创新的扩散模型,通过基于语义相似度逐步平滑词嵌入,在序列到序列生成任务中超越了现有扩散模型,展现出更优的生成质量。
English Summary: Smoothie is a novel diffusion model that enhances text generation by progressively smoothing token embeddings based on semantic similarity, achieving superior performance in sequence-to-sequence tasks compared to existing diffusion methods.

Authors:Hao Chen, Haoze Li, Zhiqing Xiao, Lirong Gao, Qi Zhang, Xiaomeng Hu, Ningtao Wang, Xing Fu, Junbo Zhao
Title: ALPS: Attention Localization and Pruning Strategy for Efficient Alignment of Large Language Models
Abstract:
Aligning general-purpose large language models (LLMs) to downstream tasks often incurs significant training adjustment costs. Prior research has explored various avenues to enhance alignment efficiency, primarily through minimal-data training or data-driven activations to identify key attention heads. However, these approaches inherently introduce data dependency, which hinders generalization and reusability. To address this issue and enhance model alignment efficiency, we propose the Attention Localization and Pruning Strategy (ALPS), an efficient algorithm that localizes the most task-sensitive attention heads and prunes by restricting attention training updates to these heads, thereby reducing alignment costs. Experimental results demonstrate that our method activates only 10% of attention parameters during fine-tuning while achieving a 2% performance improvement over baselines on three tasks. Moreover, the identified task-specific heads are transferable across datasets and mitigate knowledge forgetting. Our work and findings provide a novel perspective on efficient LLM alignment. The code is available at https://github.com/VoiceBeer/ALPS.
中文摘要:本研究提出的注意力定位与剪枝策略(ALPS)通过精确定位大语言模型中任务敏感度最高的注意力头并仅对这些头部进行微调,在激活10%参数的情况下实现性能提升2%,同时发现这些任务特定头部具有跨数据集可迁移性。
English Summary: The proposed Attention Localization and Pruning Strategy (ALPS) efficiently identifies and fine-tunes only the most task-sensitive attention heads in large language models, reducing alignment costs by activating just 10% of parameters while improving performance by 2% across three tasks.

Authors:Tao Liu, Xutao Mao, Hongying Zan, Dixuan Zhang, Yifan Li, Haixin Liu, Lulu Kong, Jiaming Hou, Rui Li, YunLong Li, aoze zheng, Zhiqiang Zhang, Luo Zhewei, Kunli Zhang, Min Peng
Title: LogicCat: A Chain-of-Thought Text-to-SQL Benchmark for Complex Reasoning
Abstract:
Text-to-SQL is a critical task in natural language processing that aims to transform natural language questions into accurate and executable SQL queries. In real-world scenarios, these reasoning tasks are often accompanied by complex mathematical computations, domain knowledge, and hypothetical reasoning scenarios. However, existing large-scale Text-to-SQL datasets typically focus on business logic and task logic, neglecting critical factors such as vertical domain knowledge, complex mathematical reasoning, and hypothetical reasoning, which are essential for realistically reflecting the reasoning demands in practical applications and completing data querying and analysis. To bridge this gap, we introduce LogicCat, the first Text-to-SQL benchmark dataset specifically designed for complex reasoning and chain-of-thought parsing, encompassing physics, arithmetic, commonsense, and hypothetical reasoning scenarios. LogicCat comprises 4,038 English questions paired 12,114 detailed chain-of-thought reasoning steps, spanning 45 databases across diverse domains, significantly surpassing existing datasets in complexity. Experimental results demonstrate that LogicCat substantially increases the task difficulty for current state-of-the-art models to at most 33.20% execution accuracy, indicating that this task remains exceptionally challenging. The advancement of LogicCat represents a crucial step toward developing systems suitable for real-world enterprise data analysis and autonomous query generation. We have released our dataset code at https://github.com/Ffunkytao/LogicCat.
中文: LogicCat是首个专为复杂推理设计的Text-to-SQL基准数据集,涵盖物理、数学、常识和假设推理场景,将当前最先进模型的执行准确率压制至33.20%,显著提升了实际应用中的推理挑战。
English: LogicCat is the first Text-to-SQL benchmark dataset specifically designed to address complex reasoning scenarios—including physics, arithmetic, commonsense, and hypothetical reasoning—significantly challenging current state-of-the-art models with execution accuracy as low as 33.20%.

Authors:Meng Li, Guangda Huzhang, Haibo Zhang, Xiting Wang, Anxiang Zeng
Title: Optimal Transport-Based Token Weighting scheme for Enhanced Preference Optimization
Abstract:
Direct Preference Optimization (DPO) has emerged as a promising framework for aligning Large Language Models (LLMs) with human preferences by directly optimizing the log-likelihood difference between chosen and rejected responses. However, existing methods assign equal importance to all tokens in the response, while humans focus on more meaningful parts. This leads to suboptimal preference optimization, as irrelevant or noisy tokens disproportionately influence DPO loss. To address this limitation, we propose \textbf{O}ptimal \textbf{T}ransport-based token weighting scheme for enhancing direct \textbf{P}reference \textbf{O}ptimization (OTPO). By emphasizing semantically meaningful token pairs and de-emphasizing less relevant ones, our method introduces a context-aware token weighting scheme that yields a more contrastive reward difference estimate. This adaptive weighting enhances reward stability, improves interpretability, and ensures that preference optimization focuses on meaningful differences between responses. Extensive experiments have validated OTPO's effectiveness in improving instruction-following ability across various settings\footnote{Code is available at https://github.com/Mimasss2/OTPO.}.
中文摘要:直接偏好优化(DPO)因对所有令牌同等处理而存在局限,因此提出的OTPO方法通过基于最优传输的令牌加权机制,重点优化语义显著的令牌对,有效提升了偏好优化的性能。
English Summary: Direct Preference Optimization (DPO) faces limitations by treating all tokens equally, so the proposed OTPO method introduces optimal transport-based token weighting to emphasize meaningful tokens and improve preference optimization effectiveness.

Authors:Guodong Du, Zitao Fang, Jing Li, Junlin Li, Runhua Jiang, Shuyang Yu, Yifei Guo, Yangneng Chen, Sim Kuan Goh, Ho-Kin Tang, Daojing He, Honghai Liu, Min Zhang
Title: Neural Parameter Search for Slimmer Fine-Tuned Models and Better Transfer
Abstract:
Foundation models and their checkpoints have significantly advanced deep learning, boosting performance across various applications. However, fine-tuned models often struggle outside their specific domains and exhibit considerable redundancy. Recent studies suggest that combining a pruned fine-tuned model with the original pre-trained model can mitigate forgetting, reduce interference when merging model parameters across tasks, and improve compression efficiency. In this context, developing an effective pruning strategy for fine-tuned models is crucial. Leveraging the advantages of the task vector mechanism, we preprocess fine-tuned models by calculating the differences between them and the original model. Recognizing that different task vector subspaces contribute variably to model performance, we introduce a novel method called Neural Parameter Search (NPS-Pruning) for slimming down fine-tuned models. This method enhances pruning efficiency by searching through neural parameters of task vectors within low-rank subspaces. Our method has three key applications: enhancing knowledge transfer through pairwise model interpolation, facilitating effective knowledge fusion via model merging, and enabling the deployment of compressed models that retain near-original performance while significantly reducing storage costs. Extensive experiments across vision, NLP, and multi-modal benchmarks demonstrate the effectiveness and robustness of our approach, resulting in substantial performance gains. The code is publicly available at: https://github.com/duguodong7/NPS-Pruning.
中文摘要:该研究提出的神经参数搜索(NPS-Pruning)方法通过利用任务向量子空间有效压缩微调模型,在保持视觉、自然语言处理和多模态任务性能的同时,显著提升了知识迁移、模型融合和存储效率。
English Summary: The proposed Neural Parameter Search (NPS-Pruning) method effectively compresses fine-tuned models by leveraging task vector subspaces, enhancing knowledge transfer, model merging, and storage efficiency while maintaining performance across vision, NLP, and multi-modal tasks.

Authors:Xu Zhang, Kun Zhang, Wenxin Ma, Rongsheng Wang, Chenxu Wu, Yingtai Li, S. Kevin Zhou
Title: A General Knowledge Injection Framework for ICD Coding
Abstract:
ICD Coding aims to assign a wide range of medical codes to a medical text document, which is a popular and challenging task in the healthcare domain. To alleviate the problems of long-tail distribution and the lack of annotations of code-specific evidence, many previous works have proposed incorporating code knowledge to improve coding performance. However, existing methods often focus on a single type of knowledge and design specialized modules that are complex and incompatible with each other, thereby limiting their scalability and effectiveness. To address this issue, we propose GKI-ICD, a novel, general knowledge injection framework that integrates three key types of knowledge, namely ICD Description, ICD Synonym, and ICD Hierarchy, without specialized design of additional modules. The comprehensive utilization of the above knowledge, which exhibits both differences and complementarity, can effectively enhance the ICD coding performance. Extensive experiments on existing popular ICD coding benchmarks demonstrate the effectiveness of GKI-ICD, which achieves the state-of-the-art performance on most evaluation metrics. Code is available at https://github.com/xuzhang0112/GKI-ICD.
中文摘要:GKI-ICD框架无需专门模块即可整合三种ICD知识类型,通过全面实验在多数评估指标上实现了最优性能。
English Summary: The GKI-ICD framework effectively integrates three types of ICD knowledge without specialized modules, achieving state-of-the-art performance on most metrics through comprehensive experiments.

Authors:Zixiang Xu, Yanbo Wang, Yue Huang, Xiuying Chen, Jieyu Zhao, Meng Jiang, Xiangliang Zhang
Title: Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models
Abstract:
Large Language Models (LLMs) have achieved remarkable success in Natural Language Processing (NLP), yet their cross-lingual performance consistency remains a significant challenge. This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in LLMs. Our approach leverages beam search and LLM-based simulation to generate bilingual question pairs that expose performance discrepancies between English and target languages. We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models. The extensive experiments demonstrate that our method precisely and cost-effectively pinpoints cross-lingual weaknesses, consistently revealing over 50\% accuracy drops in target languages across a wide range of models. Moreover, further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns and benefit from targeted post-training. Code is available at https://github.com/xzx34/Cross-Lingual-Pitfalls.
中文: 本文提出了一种利用集束搜索和基于大语言模型的模拟的新方法,有效识别大语言模型的跨语言弱点,揭示了目标语言中准确率显著下降的现象,并证明语言亲缘关系越近的语种表现模式越相似。
English: This paper introduces a novel methodology using beam search and LLM-based simulation to efficiently identify cross-lingual weaknesses in Large Language Models, revealing significant accuracy drops in target languages and demonstrating that linguistically related languages share similar performance patterns.

Authors:Md. Tanzib Hosain, Rajan Das Gupta, Md. Kishor Morol
Title: Multilingual Question Answering in Low-Resource Settings: A Dzongkha-English Benchmark for Foundation Models
Abstract:
In this work, we provide DZEN, a dataset of parallel Dzongkha and English test questions for Bhutanese middle and high school students. The over 5K questions in our collection span a variety of scientific topics and include factual, application, and reasoning-based questions. We use our parallel dataset to test a number of Large Language Models (LLMs) and find a significant performance difference between the models in English and Dzongkha. We also look at different prompting strategies and discover that Chain-of-Thought (CoT) prompting works well for reasoning questions but less well for factual ones. We also find that adding English translations enhances the precision of Dzongkha question responses. Our results point to exciting avenues for further study to improve LLM performance in Dzongkha and, more generally, in low-resource languages. We release the dataset at: https://github.com/kraritt/llm_dzongkha_evaluation.
中文: 本研究发布了DZEN数据集,包含5000多道不丹中学生使用的宗喀语与英语平行试题,发现大型语言模型在两种语言间存在显著性能差异,并证明思维链提示和英语翻译能有效提升宗喀语问题的回答准确率。
English: This study introduces DZEN, a parallel dataset of over 5,000 Dzongkha and English questions for Bhutanese students, revealing significant performance gaps in LLMs between the two languages and showing that Chain-of-Thought prompting and English translations improve Dzongkha question accuracy.

Authors:Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu
Title: MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation
Abstract:
Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.
中文: MAVL基准和SylAVL-CoT模型通过整合多模态线索与音节约束,显著提升了可唱歌词翻译的自然度与准确性,全面优于纯文本翻译方法。
English: The MAVL benchmark and SylAVL-CoT model enhance singable lyrics translation by integrating multimodal cues and syllabic constraints, significantly outperforming text-only methods in both naturalness and accuracy.

Authors:Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiaohui Song, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
Title: PM-KVQ: Progressive Mixed-precision KV Cache Quantization for Long-CoT LLMs
Abstract:
Recently, significant progress has been made in developing reasoning-capable Large Language Models (LLMs) through long Chain-of-Thought (CoT) techniques. However, this long-CoT reasoning process imposes substantial memory overhead due to the large Key-Value (KV) Cache memory overhead. Post-training KV Cache quantization has emerged as a promising compression technique and has been extensively studied in short-context scenarios. However, directly applying existing methods to long-CoT LLMs causes significant performance degradation due to the following two reasons: (1) Large cumulative error: Existing methods fail to adequately leverage available memory, and they directly quantize the KV Cache during each decoding step, leading to large cumulative quantization error. (2) Short-context calibration: Due to Rotary Positional Embedding (RoPE), the use of short-context data during calibration fails to account for the distribution of less frequent channels in the Key Cache, resulting in performance loss. We propose Progressive Mixed-Precision KV Cache Quantization (PM-KVQ) for long-CoT LLMs to address the above issues in two folds: (1) To reduce cumulative error, we design a progressive quantization strategy to gradually lower the bit-width of KV Cache in each block. Then, we propose block-wise memory allocation to assign a higher bit-width to more sensitive transformer blocks. (2) To increase the calibration length without additional overhead, we propose a new calibration strategy with positional interpolation that leverages short calibration data with positional interpolation to approximate the data distribution of long-context data. Extensive experiments on 7B-70B long-CoT LLMs show that PM-KVQ improves reasoning benchmark performance by up to 8% over SOTA baselines under the same memory budget. Our code is available at https://github.com/thu-nics/PM-KVQ.
中文: 针对长思维链大语言模型中KV缓存导致的高内存开销问题,现有量化方法因累积误差和短上下文校准而性能下降;提出的PM-KVQ通过渐进混合精度量化和位置插值校准策略,在相同内存预算下将推理基准性能提升最高达8%。
English: Recent advancements in long Chain-of-Thought reasoning for Large Language Models face significant memory overhead from KV Cache, which existing quantization methods degrade due to cumulative errors and short-context calibration; the proposed PM-KVQ addresses these with progressive mixed-precision quantization and positional interpolation, improving benchmark performance by up to 8%.

Authors:Jongwoo Ko, Sungnyun Kim, Sungwoo Cho, Se-Young Yun
Title: Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
Abstract:
Human-generated reward signals are critical for aligning generative models with human preferences, guiding both training and inference-time evaluations. While large language models (LLMs) employed as proxy evaluators, i.e., LLM-as-a-Judge, significantly reduce the costs associated with manual annotations, they typically require extensive modality-specific training data and fail to generalize well across diverse multimodal tasks. In this paper, we propose Flex-Judge, a reasoning-guided multimodal judge model that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats. Our core intuition is that structured textual reasoning explanations inherently encode generalizable decision-making patterns, enabling an effective transfer to multimodal judgments, e.g., with images or videos. Empirical results demonstrate that Flex-Judge, despite being trained on significantly fewer text data, achieves competitive or superior performance compared to state-of-the-art commercial APIs and extensively trained multimodal evaluators. Notably, Flex-Judge presents broad impact in modalities like molecule, where comprehensive evaluation benchmarks are scarce, underscoring its practical value in resource-constrained domains. Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable multimodal model-as-a-judge.
中文摘要:Flex-Judge是一种基于推理的多模态评估模型,通过少量文本推理数据即可泛化至多种模态和评估形式,以更少训练资源实现了与先进模型相媲美的性能。
English Summary: Flex-Judge is a reasoning-guided multimodal judge model that uses minimal textual reasoning data to generalize effectively across multiple modalities and evaluation formats, achieving competitive performance with fewer training resources.

Authors:Wentao Hu, Wengyu Zhang, Yiyang Jiang, Chen Jason Zhang, Xiaoyong Wei, Qing Li
Title: Removal of Hallucination on Hallucination: Debate-Augmented RAG
Abstract:
Retrieval-Augmented Generation (RAG) enhances factual accuracy by integrating external knowledge, yet it introduces a critical issue: erroneous or biased retrieval can mislead generation, compounding hallucinations, a phenomenon we term Hallucination on Hallucination. To address this, we propose Debate-Augmented RAG (DRAG), a training-free framework that integrates Multi-Agent Debate (MAD) mechanisms into both retrieval and generation stages. In retrieval, DRAG employs structured debates among proponents, opponents, and judges to refine retrieval quality and ensure factual reliability. In generation, DRAG introduces asymmetric information roles and adversarial debates, enhancing reasoning robustness and mitigating factual inconsistencies. Evaluations across multiple tasks demonstrate that DRAG improves retrieval reliability, reduces RAG-induced hallucinations, and significantly enhances overall factual accuracy. Our code is available at https://github.com/Huenao/Debate-Augmented-RAG.
Chinese: 摘要提出了一种无需训练的辩论增强检索生成框架(DRAG),通过在检索阶段引入结构化辩论和在生成阶段采用对抗性辩论,有效提升检索可靠性、减少幻觉现象,从而显著提高事实准确性。
English: The abstract introduces Debate-Augmented RAG (DRAG), a training-free framework that integrates multi-agent debate mechanisms to improve retrieval reliability and reduce hallucinations by employing structured debates during retrieval and adversarial debates in generation, thereby enhancing factual accuracy.

Authors:Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
Title: Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
Abstract:
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model's exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
中文: 本研究提出动态分配训练资源和自适应调整温度的策略,以优化大型语言模型的强化学习过程,实现高效训练并保持探索能力。
English: This study introduces a dynamic rollout allocation mechanism and adaptive temperature adjustment strategy to enhance reinforcement learning for large language models, enabling more efficient training and sustained exploratory capacity.

Authors:Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, Xueqian Wang
Title: Reinforcement Fine-Tuning Powers Reasoning Capability of Multimodal Large Language Models
Abstract:
Standing in 2025, at a critical juncture in the pursuit of Artificial General Intelligence (AGI), reinforcement fine-tuning (RFT) has demonstrated significant potential in enhancing the reasoning capability of large language models (LLMs) and has led to the development of cutting-edge AI models such as OpenAI-o1 and DeepSeek-R1. Moreover, the efficient application of RFT to enhance the reasoning capability of multimodal large language models (MLLMs) has attracted widespread attention from the community. In this position paper, we argue that reinforcement fine-tuning powers the reasoning capability of multimodal large language models. To begin with, we provide a detailed introduction to the fundamental background knowledge that researchers interested in this field should be familiar with. Furthermore, we meticulously summarize the improvements of RFT in powering reasoning capability of MLLMs into five key points: diverse modalities, diverse tasks and domains, better training algorithms, abundant benchmarks and thriving engineering frameworks. Finally, we propose five promising directions for future research that the community might consider. We hope that this position paper will provide valuable insights to the community at this pivotal stage in the advancement toward AGI. Summary of works done on RFT for MLLMs is available at https://github.com/Sun-Haoyuan23/Awesome-RL-based-Reasoning-MLLMs.
中文摘要:强化微调技术显著提升了多模态大语言模型的推理能力,本文通过五大改进方向与未来研究路径,为通用人工智能发展的关键阶段提供了重要见解。
English Summary: Reinforcement fine-tuning significantly enhances the reasoning capabilities of multimodal large language models, as detailed through five key improvements and future research directions in this pivotal AGI development stage.

Authors:Guodong Du, Xuanning Zhou, Junlin Li, Zhuo Li, Zesheng Shi, Wanyu Lin, Ho-Kin Tang, Xiucheng Li, Fangming Liu, Wenya Wang, Min Zhang, Jing Li
Title: Knowledge Grafting of Large Language Models
Abstract:
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning. Recent works like FuseLLM and FuseChat have demonstrated the potential of transferring multiple model capabilities to lightweight models, enhancing adaptability and efficiency, which motivates our investigation into more efficient cross-capability transfer methods. However, existing approaches primarily focus on small, homogeneous models, limiting their applicability. For large, heterogeneous models, knowledge distillation with full-parameter fine-tuning often overlooks the student model's intrinsic capacity and risks catastrophic forgetting, while PEFT methods struggle to effectively absorb knowledge from source LLMs. To address these issues, we introduce GraftLLM, a novel method that stores source model capabilities in a target model with SkillPack format. This approach preserves general capabilities, reduces parameter conflicts, and supports forget-free continual learning and model fusion. We employ a module-aware adaptive compression strategy to compress parameter updates, ensuring efficient storage while maintaining task-specific knowledge. The resulting SkillPack serves as a compact and transferable knowledge carrier, ideal for heterogeneous model fusion and continual learning. Experiments across various scenarios demonstrate that GraftLLM outperforms existing techniques in knowledge transfer, knowledge fusion, and forget-free learning, providing a scalable and efficient solution for cross-capability transfer. The code is publicly available at: https://github.com/duguodong7/GraftLLM.
Chinese: GraftLLM提出了一种新颖的跨能力迁移方法,通过SkillPack格式在异构模型间高效存储和传递知识,既能防止灾难性遗忘,又能实现可扩展的持续学习。
English: GraftLLM introduces a novel cross-capability transfer method using SkillPack format to efficiently store and transfer knowledge between heterogeneous models while preventing catastrophic forgetting and enabling scalable continual learning.

Authors:Xuanhe Zhou, Junxuan He, Wei Zhou, Haodong Chen, Zirui Tang, Haoyu Zhao, Xin Tong, Guoliang Li, Youmin Chen, Jun Zhou, Zhaojun Sun, Binyuan Hui, Shuo Wang, Conghui He, Zhiyuan Liu, Jingren Zhou, Fan Wu
Title: A Survey of LLM $\times$ DATA
Abstract:
The integration of large language model (LLM) and data management (DATA) is rapidly redefining both domains. In this survey, we comprehensively review the bidirectional relationships. On the one hand, DATA4LLM, spanning large-scale data processing, storage, and serving, feeds LLMs with high quality, diversity, and timeliness of data required for stages like pre-training, post-training, retrieval-augmented generation, and agentic workflows: (i) Data processing for LLMs includes scalable acquisition, deduplication, filtering, selection, domain mixing, and synthetic augmentation; (ii) Data Storage for LLMs focuses on efficient data and model formats, distributed and heterogeneous storage hierarchies, KV-cache management, and fault-tolerant checkpointing; (iii) Data serving for LLMs tackles challenges in RAG (e.g., knowledge post-processing), LLM inference (e.g., prompt compression, data provenance), and training strategies (e.g., data packing and shuffling). On the other hand, in LLM4DATA, LLMs are emerging as general-purpose engines for data management. We review recent advances in (i) data manipulation, including automatic data cleaning, integration, discovery; (ii) data analysis, covering reasoning over structured, semi-structured, and unstructured data, and (iii) system optimization (e.g., configuration tuning, query rewriting, anomaly diagnosis), powered by LLM techniques like retrieval-augmented prompting, task-specialized fine-tuning, and multi-agent collaboration.
中文: 本综述探讨了大型语言模型与数据管理的双向融合,既涵盖数据系统通过处理、存储和服务支持模型开发,也涉及模型如何优化数据操作、分析和系统管理。
English: This survey explores the bidirectional integration between large language models (LLMs) and data management, covering how data systems support LLM development through processing, storage, and serving, while LLMs enhance data tasks like manipulation, analysis, and system optimization.

Authors:Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong
Title: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding
Abstract:
We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
中文: DanmakuTPPBench推出了一个多模态基准,整合了来自B站弹幕的时间事件数据和问答数据集,以推动时序点过程建模的发展,揭示了现有方法在处理时序-文本-视觉推理方面的不足。
English: DanmakuTPPBench introduces a multi-modal benchmark combining temporal event data from Bilibili's bullet comments with a QA dataset to advance Temporal Point Process modeling, revealing current methods' limitations in handling temporal-textual-visual reasoning.

Authors:Abdellah El Mekki, Houdaifa Atou, Omer Nacar, Shady Shehata, Muhammad Abdul-Mageed
Title: NileChat: Towards Linguistically Diverse and Culturally Aware LLMs for Local Communities
Abstract:
Enhancing the linguistic capabilities of Large Language Models (LLMs) to include low-resource languages is a critical research area. Current research directions predominantly rely on synthetic data generated by translating English corpora, which, while demonstrating promising linguistic understanding and translation abilities, often results in models aligned with source language culture. These models frequently fail to represent the cultural heritage and values of local communities. This work proposes a methodology to create both synthetic and retrieval-based pre-training data tailored to a specific community, considering its (i) language, (ii) cultural heritage, and (iii) cultural values. We demonstrate our methodology using Egyptian and Moroccan dialects as testbeds, chosen for their linguistic and cultural richness and current underrepresentation in LLMs. As a proof-of-concept, we develop NileChat, a 3B parameter Egyptian and Moroccan Arabic LLM adapted for Egyptian and Moroccan communities, incorporating their language, cultural heritage, and values. Our results on various understanding, translation, and cultural and values alignment benchmarks show that NileChat outperforms existing Arabic-aware LLMs of similar size and performs on par with larger models. This work addresses Arabic dialect in LLMs with a focus on cultural and values alignment via controlled synthetic data generation and retrieval-augmented pre-training for Moroccan Darija and Egyptian Arabic, including Arabizi variants, advancing Arabic NLP for low-resource communities. We share our methods, data, and models with the community to promote the inclusion and coverage of more diverse communities in cultural LLM development: https://github.com/UBC-NLP/nilechat .
Chinese: 本研究提出了一种方法,通过生成文化和语言上定制化的数据来增强大语言模型对低资源语言的支持,并以尼罗河聊天模型为例,展示了其在埃及和摩洛哥方言的理解、翻译及文化对齐方面的卓越表现。
English: This research introduces a methodology to enhance LLMs for low-resource languages by generating culturally and linguistically tailored data, demonstrated through NileChat, a model that excels in understanding, translation, and cultural alignment for Egyptian and Moroccan dialects.

Authors:Zifu Wan, Yaqi Xie, Ce Zhang, Zhiqiu Lin, Zihan Wang, Simon Stepputtis, Deva Ramanan, Katia Sycara
Title: InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning
Abstract:
Large multimodal foundation models, particularly in the domains of language and vision, have significantly advanced various tasks, including robotics, autonomous driving, information retrieval, and grounding. However, many of these models perceive objects as indivisible, overlooking the components that constitute them. Understanding these components and their associated affordances provides valuable insights into an object's functionality, which is fundamental for performing a wide range of tasks. In this work, we introduce a novel real-world benchmark, InstructPart, comprising hand-labeled part segmentation annotations and task-oriented instructions to evaluate the performance of current models in understanding and executing part-level tasks within everyday contexts. Through our experiments, we demonstrate that task-oriented part segmentation remains a challenging problem, even for state-of-the-art Vision-Language Models (VLMs). In addition to our benchmark, we introduce a simple baseline that achieves a twofold performance improvement through fine-tuning with our dataset. With our dataset and benchmark, we aim to facilitate research on task-oriented part segmentation and enhance the applicability of VLMs across various domains, including robotics, virtual reality, information retrieval, and other related fields. Project website: https://zifuwan.github.io/InstructPart/.
中文摘要:本文提出InstructPart基准,用于评估视觉语言模型在部件级物体理解方面的能力,揭示了现有模型的不足,并通过微调方法实现了性能翻倍的基线改进。
English Summary: This paper introduces InstructPart, a benchmark for evaluating part-level object understanding in vision-language models, revealing current models' limitations and proposing a fine-tuned baseline that doubles performance.

Authors:Jianghao Wu, Feilong Tang, Yulong Li, Ming Hu, Haochen Xue, Shoaib Jameel, Yutong Xie, Imran Razzak
Title: TAGS: A Test-Time Generalist-Specialist Framework with Retrieval-Augmented Reasoning and Verification
Abstract:
Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist-specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates. The code will be available at https://github.com/JianghaoWu/TAGS.
中文: TAGS框架通过结合通用与专业模型及辅助模块,无需微调即可显著提升多个医学问答基准的准确率,实现卓越的医疗推理性能。
English: TAGS is a test-time framework that combines generalist and specialist models with auxiliary modules to enhance medical reasoning, achieving significant accuracy improvements across multiple benchmarks without fine-tuning.

Authors:Wafa Alghallabi, Ritesh Thawkar, Sara Ghaboura, Ketan More, Omkar Thawakar, Hisham Cholakkal, Salman Khan, Rao Muhammad Anwer
Title: Fann or Flop: A Multigenre, Multiera Benchmark for Arabic Poetry Understanding in LLMs
Abstract:
Arabic poetry is one of the richest and most culturally rooted forms of expression in the Arabic language, known for its layered meanings, stylistic diversity, and deep historical continuity. Although large language models (LLMs) have demonstrated strong performance across languages and tasks, their ability to understand Arabic poetry remains largely unexplored. In this work, we introduce \emph{Fann or Flop}, the first benchmark designed to assess the comprehension of Arabic poetry by LLMs in 12 historical eras, covering 14 core poetic genres and a variety of metrical forms, from classical structures to contemporary free verse. The benchmark comprises a curated corpus of poems with explanations that assess semantic understanding, metaphor interpretation, prosodic awareness, and cultural context. We argue that poetic comprehension offers a strong indicator for testing how good the LLM understands classical Arabic through Arabic poetry. Unlike surface-level tasks, this domain demands deeper interpretive reasoning and cultural sensitivity. Our evaluation of state-of-the-art LLMs shows that most models struggle with poetic understanding despite strong results on standard Arabic benchmarks. We release "Fann or Flop" along with the evaluation suite as an open-source resource to enable rigorous evaluation and advancement for Arabic language models. Code is available at: https://github.com/mbzuai-oryx/FannOrFlop.
中文摘要:本研究推出了首个评估大语言模型对阿拉伯诗歌理解的基准“Fann or Flop”,涵盖多个历史时期和诗歌体裁,发现尽管模型在标准阿拉伯语任务中表现优异,但在深层诠释和文化理解方面仍存在困难。
English Summary: The study introduces "Fann or Flop," the first benchmark to evaluate large language models' comprehension of Arabic poetry across historical eras and genres, revealing their struggles with deeper interpretive and cultural understanding despite strong performance on standard tasks.

Authors:Kazem Faghih, Wenxiao Wang, Yize Cheng, Siddhant Bharti, Gaurang Sriramanan, Sriram Balasubramanian, Parsa Hosseini, Soheil Feizi
Title: Tool Preferences in Agentic LLMs are Unreliable
Abstract:
Large language models (LLMs) can now access a wide range of external tools, thanks to the Model Context Protocol (MCP). This greatly expands their abilities as various agents. However, LLMs rely entirely on the text descriptions of tools to decide which ones to use--a process that is surprisingly fragile. In this work, we expose a vulnerability in prevalent tool/function-calling protocols by investigating a series of edits to tool descriptions, some of which can drastically increase a tool's usage from LLMs when competing with alternatives. Through controlled experiments, we show that tools with properly edited descriptions receive over 10 times more usage from GPT-4.1 and Qwen2.5-7B than tools with original descriptions. We further evaluate how various edits to tool descriptions perform when competing directly with one another and how these trends generalize or differ across a broader set of 17 different models. These phenomena, while giving developers a powerful way to promote their tools, underscore the need for a more reliable foundation for agentic LLMs to select and utilize tools and resources. Our code is publicly available at https://github.com/kazemf78/llm-unreliable-tool-preferences.
中文: 大型语言模型选择工具时易受描述篡改的影响,某些编辑后的描述可使工具使用率激增十倍以上,这凸显了建立更可靠协议的必要性。
English: Large language models' tool selection is vulnerable to manipulated descriptions, with edited versions increasing usage over tenfold in some cases, highlighting the need for more reliable protocols.

Authors:Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li, Qibing Ren, Lizhuang Ma, Yuchao Dai, Pengfei Liu, Junjie Yan
Title: One RL to See Them All: Visual Triple Unified Reinforcement Learning
Abstract:
Reinforcement learning (RL) has significantly advanced the reasoning capabilities of vision-language models (VLMs). However, the use of RL beyond reasoning tasks remains largely unexplored, especially for perceptionintensive tasks like object detection and grounding. We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables VLMs to jointly learn visual reasoning and perception tasks within a single training pipeline. V-Triune comprises triple complementary components: Sample-Level Data Formatting (to unify diverse task inputs), Verifier-Level Reward Computation (to deliver custom rewards via specialized verifiers) , and Source-Level Metric Monitoring (to diagnose problems at the data-source level). We further introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune. Our approach is instantiated within off-the-shelf RL training framework using open-source 7B and 32B backbone models. The resulting model, dubbed Orsta (One RL to See Them All), demonstrates consistent improvements across both reasoning and perception tasks. This broad capability is significantly shaped by its training on a diverse dataset, constructed around four representative visual reasoning tasks (Math, Puzzle, Chart, and Science) and four visual perception tasks (Grounding, Detection, Counting, and OCR). Subsequently, Orsta achieves substantial gains on MEGA-Bench Core, with improvements ranging from +2.1 to an impressive +14.1 across its various 7B and 32B model variants, with performance benefits extending to a wide range of downstream tasks. These results highlight the effectiveness and scalability of our unified RL approach for VLMs. The V-Triune system, along with the Orsta models, is publicly available at https://github.com/MiniMax-AI.
中文: V-Triune提出了一种统一强化学习系统,通过样本级数据格式化、验证器级奖励计算和源级指标监控三大组件,使视觉语言模型能够同时掌握推理与感知任务,并在各类基准测试中实现显著性能提升。
English: V-Triune introduces a unified reinforcement learning system that enables vision-language models to jointly master both reasoning and perception tasks, achieving significant performance gains across diverse benchmarks through its triple-component architecture and novel Dynamic IoU reward mechanism.

Authors:Lisheng Huang, Yichen Liu, Jinhao Jiang, Rongxiang Zhang, Jiahao Yan, Junyi Li, Wayne Xin Zhao
Title: ManuSearch: Democratizing Deep Search in Large Language Models with a Transparent and Open Multi-Agent Framework
Abstract:
Recent advances in web-augmented large language models (LLMs) have exhibited strong performance in complex reasoning tasks, yet these capabilities are mostly locked in proprietary systems with opaque architectures. In this work, we propose \textbf{ManuSearch}, a transparent and modular multi-agent framework designed to democratize deep search for LLMs. ManuSearch decomposes the search and reasoning process into three collaborative agents: (1) a solution planning agent that iteratively formulates sub-queries, (2) an Internet search agent that retrieves relevant documents via real-time web search, and (3) a structured webpage reading agent that extracts key evidence from raw web content. To rigorously evaluate deep reasoning abilities, we introduce \textbf{ORION}, a challenging benchmark focused on open-web reasoning over long-tail entities, covering both English and Chinese. Experimental results show that ManuSearch substantially outperforms prior open-source baselines and even surpasses leading closed-source systems. Our work paves the way for reproducible, extensible research in open deep search systems. We release the data and code in https://github.com/RUCAIBox/ManuSearch
中文:ManuSearch是一个透明的多智能体框架,通过将深度搜索分解为规划、网络搜索和内容提取三个协作代理,使大型语言模型的深度搜索能力民主化,并在ORION基准测试中显著超越现有系统。
English: ManuSearch is a transparent, multi-agent framework that democratizes deep search for large language models by decomposing the process into planning, web search, and content extraction agents, significantly outperforming existing systems on the new ORION benchmark.

Authors:Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu
Title: Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding
Abstract:
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips. Different from previous video agents manually designing a rigid workflow, our approach emphasizes the autonomous nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools, formulates appropriate parameters for actions, and iteratively refines its internal reasoning in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates the advantage of the entire system design. Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset. Comprehensive ablation studies and in-depth tool analyses are also provided, yielding insights to further advance intelligent agents tailored for long-form video understanding tasks. The code has been released in https://github.com/microsoft/DeepVideoDiscovery.
Chinese: Deep Video Discovery 代理通过采用基于工具的自主搜索策略处理分段视频,克服了大型语言模型在处理长视频时的限制,在 LVBench 等基准测试中取得了领先性能。
English: The Deep Video Discovery agent overcomes LLM limitations in processing long videos by employing an autonomous, tool-based search strategy across segmented clips, achieving state-of-the-art results on benchmarks like LVBench.

Authors:Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou
Title: Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
Abstract:
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.
中文: Trinity-RFT 是一个通用、统一且易用的强化微调框架,采用模块化设计,整合了多种RFT模式,高效处理智能体与环境交互,并提供优化的数据管道,适用于广泛的应用场景和研究开发。
English: Trinity-RFT is a versatile and user-friendly framework for reinforcement fine-tuning of large language models, featuring a modular design that unifies various RFT modes, integrates agent-environment interactions efficiently, and provides optimized data pipelines for diverse applications and research.

Authors:Tazeek Bin Abdur Rakib, Ambuj Mehrish, Lay-Ki Soon, Wern Han Lim, Soujanya Poria
Title: DialogXpert: Driving Intelligent and Emotion-Aware Conversations through Online Value-Based Reinforcement Learning with LLM Priors
Abstract:
Large-language-model (LLM) agents excel at reactive dialogue but struggle with proactive, goal-driven interactions due to myopic decoding and costly planning. We introduce DialogXpert, which leverages a frozen LLM to propose a small, high-quality set of candidate actions per turn and employs a compact Q-network over fixed BERT embeddings trained via temporal-difference learning to select optimal moves within this reduced space. By tracking the user's emotions, DialogXpert tailors each decision to advance the task while nurturing a genuine, empathetic connection. Across negotiation, emotional support, and tutoring benchmarks, DialogXpert drives conversations to under $3$ turns with success rates exceeding 94\% and, with a larger LLM prior, pushes success above 97\% while markedly improving negotiation outcomes. This framework delivers real-time, strategic, and emotionally intelligent dialogue planning at scale. Code available at https://github.com/declare-lab/dialogxpert/
Chinese: DialogXpert通过利用冻结大语言模型生成候选行动,并采用紧凑Q网络选择最优决策,显著提升了LLM代理在目标驱动对话中的表现,在不到三轮对话中实现超过94%的成功率,同时融入情感智能以建立共情连接。
English: DialogXpert enhances LLM agents for proactive, goal-driven dialogues by using a frozen LLM to generate candidate actions and a compact Q-network to select optimal moves, achieving over 94% success rates in under three turns while incorporating emotional intelligence for empathetic interactions.

Authors:Ziyu Ge, Yuhao Wu, Daniel Wai Kit Chin, Roy Ka-Wei Lee, Rui Cao
Title: Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs
Abstract:
Large Language Models (LLMs) augmented with retrieval mechanisms have demonstrated significant potential in fact-checking tasks by integrating external knowledge. However, their reliability decreases when confronted with conflicting evidence from sources of varying credibility. This paper presents the first systematic evaluation of Retrieval-Augmented Generation (RAG) models for fact-checking in the presence of conflicting evidence. To support this study, we introduce \textbf{CONFACT} (\textbf{Con}flicting Evidence for \textbf{Fact}-Checking) (Dataset available at https://github.com/zoeyyes/CONFACT), a novel dataset comprising questions paired with conflicting information from various sources. Extensive experiments reveal critical vulnerabilities in state-of-the-art RAG methods, particularly in resolving conflicts stemming from differences in media source credibility. To address these challenges, we investigate strategies to integrate media background information into both the retrieval and generation stages. Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance.
中文摘要:检索增强语言模型在事实核查任务中潜力显著但处理冲突证据时可靠性下降,为此我们开发了CONFACT数据集并提出通过整合信息来源可信度来有效提升模型解决证据冲突能力的方法。
English Summary: Retrieval-augmented language models show promise for fact-checking but struggle with conflicting evidence, leading to the creation of the CONFACT dataset and methods that improve performance by incorporating source credibility information.

Authors:Wei Huang, Yizhe Xiong, Xin Ye, Zhijie Deng, Hui Chen, Zijia Lin, Guiguang Ding
Title: Fast Quiet-STaR: Thinking Without Thought Tokens
Abstract:
Large Language Models (LLMs) have achieved impressive performance across a range of natural language processing tasks. However, recent advances demonstrate that further gains particularly in complex reasoning tasks require more than merely scaling up model sizes or training data. One promising direction is to enable models to think during the reasoning process. Recently, Quiet STaR significantly improves reasoning by generating token-level thought traces, but incurs substantial inference overhead. In this work, we propose Fast Quiet STaR, a more efficient reasoning framework that preserves the benefits of token-level reasoning while reducing computational cost. Our method introduces a curriculum learning based training strategy that gradually reduces the number of thought tokens, enabling the model to internalize more abstract and concise reasoning processes. We further extend this approach to the standard Next Token Prediction (NTP) setting through reinforcement learning-based fine-tuning, resulting in Fast Quiet-STaR NTP, which eliminates the need for explicit thought token generation during inference. Experiments on four benchmark datasets with Mistral 7B and Qwen2.5 7B demonstrate that Fast Quiet-STaR consistently outperforms Quiet-STaR in terms of average accuracy under the same inference time budget. Notably, Fast Quiet-STaR NTP achieves an average accuracy improvement of 9\% on Mistral 7B and 5.7\% on Qwen2.5 7B, while maintaining the same inference latency. Our code will be available at https://github.com/huangwei200012/Fast-Quiet-STaR.
中文:Fast Quiet STaR通过课程学习和强化学习优化推理过程,在保持相同推理速度的同时显著提升了多个模型的准确率。
English: Fast Quiet STaR enhances reasoning efficiency by reducing computational overhead through curriculum learning and reinforcement learning, achieving higher accuracy without increasing inference time.

Authors:Joakim Edin, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe
Title: GIM: Improved Interpretability for Large Language Models
Abstract:
Ensuring faithful interpretability in large language models is imperative for trustworthy and reliable AI. A key obstacle is self-repair, a phenomenon where networks compensate for reduced signal in one component by amplifying others, masking the true importance of the ablated component. While prior work attributes self-repair to layer normalization and back-up components that compensate for ablated components, we identify a novel form occurring within the attention mechanism, where softmax redistribution conceals the influence of important attention scores. This leads traditional ablation and gradient-based methods to underestimate the significance of all components contributing to these attention scores. We introduce Gradient Interaction Modifications (GIM), a technique that accounts for self-repair during backpropagation. Extensive experiments across multiple large language models (Gemma 2B/9B, LLAMA 1B/3B/8B, Qwen 1.5B/3B) and diverse tasks demonstrate that GIM significantly improves faithfulness over existing circuit identification and feature attribution methods. Our work is a significant step toward better understanding the inner mechanisms of LLMs, which is crucial for improving them and ensuring their safety. Our code is available at https://github.com/JoakimEdin/gim.
中文: 该摘要介绍了梯度交互修正(GIM)这一新技术,它通过解决注意力机制中的自我修复现象,显著提升了大型语言模型可解释方法的可靠性,并在多种模型和任务中验证了其有效性。
English: The abstract introduces Gradient Interaction Modifications (GIM), a novel technique that addresses self-repair within attention mechanisms to enhance the faithfulness of interpretability methods in large language models, demonstrating significant improvements across various models and tasks.

Authors:Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang
Title: Distilling LLM Agent into Small Models with Retrieval and Code Tools
Abstract:
Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.
中文: 本文提出智能体蒸馏框架,通过改进的思维轨迹生成和自洽行动方法,将大型语言模型的完整推理与工具使用能力迁移至小型模型,使0.5B-3B参数的小模型能达到更大模型的性能水平。
English: This paper introduces Agent Distillation, a framework that transfers comprehensive reasoning and tool-using capabilities from large language models to smaller ones through enhanced trajectory generation and self-consistent action methods, enabling compact 0.5B-3B models to match larger counterparts' performance.

Authors:Linbao Li, Yannan Liu, Daojing He, Yu Li
Title: One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
Abstract:
Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts. We make the codebase available at https://github.com/LLBao/ArrAttack.
Chinese: ArrAttack是一种新颖的越狱攻击方法,能自动生成绕过多种防御机制的鲁棒提示,在各类大语言模型上显著优于现有攻击策略并展现出强大的迁移能力。
English: ArrAttack is a novel jailbreak method that automatically generates robust prompts to bypass various defenses in large language models, significantly outperforming existing attacks and demonstrating strong transferability across models.

Authors:Jingjing Jiang, Chongjie Si, Jun Luo, Hanwang Zhang, Chao Ma
Title: Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
Abstract:
This paper presents a pioneering exploration of reinforcement learning (RL) via group relative policy optimization for unified multimodal large language models (ULMs), aimed at simultaneously reinforcing generation and understanding capabilities. Through systematic pilot studies, we uncover the significant potential of ULMs to enable the synergistic co-evolution of dual capabilities within a shared policy optimization framework. Building on this insight, we introduce CoRL, a co-reinforcement learning framework comprising a unified RL stage for joint optimization and a refined RL stage for task-specific enhancement. With the proposed CoRL, our resulting model, ULM-R1, achieves average improvements of 7% on three text-to-image generation datasets and 23% on nine multimodal understanding benchmarks. These results demonstrate the effectiveness of CoRL and highlight the substantial benefit of reinforcement learning in facilitating cross-task synergy and optimization for ULMs. Code is available at https://github.com/mm-vl/ULM-R1.
中文摘要:本文提出CoRL协同强化学习框架,通过统一策略优化同时增强多模态大语言模型的生成与理解能力,在多项基准测试中取得显著性能提升。
English Summary: This paper introduces CoRL, a co-reinforcement learning framework that enhances both generation and understanding in multimodal large language models, achieving significant performance gains across multiple benchmarks.

Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu, Andrew C Yao
Title: On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Abstract:
Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of large language models (LLMs). Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.
策略梯度算法提升了大型语言模型的推理能力,本研究提出的正则化策略梯度(RPG)框架统一了KL正则化变体,修正了离策略加权问题,并在数学推理基准测试中显著提高了训练稳定性和准确性。
Policy gradient algorithms enhance large language models' reasoning, and this study introduces the Regularized Policy Gradient (RPG) framework to unify KL regularization variants, correct off-policy weighting issues, and improve training stability and accuracy on benchmarks.

Authors:Xiaohao Liu, Xiaobo Xia, Weixiang Zhao, Manyi Zhang, Xianzhi Yu, Xiu Su, Shuo Yang, See-Kiong Ng, Tat-Seng Chua
Title: L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models
Abstract:
Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to its inherently sequential process. To overcome these challenges, we propose leap multi-token prediction~(L-MTP), an innovative token prediction method that extends the capabilities of multi-token prediction (MTP) by introducing a leap-based mechanism. Unlike conventional MTP, which generates multiple tokens at adjacent positions, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass. This structured leap not only enhances the model's ability to capture long-range dependencies but also enables a decoding strategy specially optimized for non-sequential leap token generation, effectively accelerating inference. We theoretically demonstrate the benefit of L-MTP in improving inference efficiency. Experiments across diverse benchmarks validate its merit in boosting both LLM performance and inference speed. The source code is available at https://github.com/Xiaohao-Liu/L-MTP.
中文摘要:提出的跳跃多标记预测(L-MTP)方法通过单次前向传播预测非连续标记,有效增强了语言模型的远程依赖捕捉能力并加速了推理过程。
English Summary: The proposed leap multi-token prediction (L-MTP) method enhances language models by predicting non-sequential tokens in a single pass, improving both contextual understanding and inference speed.

Authors:Minsoo Khang, Sangjun Park, Teakgyu Hong, Dawoon Jung
Title: CReSt: A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents
Abstract:
Large Language Models (LLMs) have made substantial progress in recent years, yet evaluating their capabilities in practical Retrieval-Augmented Generation (RAG) scenarios remains challenging. In practical applications, LLMs must demonstrate complex reasoning, refuse to answer appropriately, provide precise citations, and effectively understand document layout. These capabilities are crucial for advanced task handling, uncertainty awareness, maintaining reliability, and structural understanding. While some of the prior works address these aspects individually, there is a need for a unified framework that evaluates them collectively in practical RAG scenarios. To address this, we present CReSt (A Comprehensive Benchmark for Retrieval-Augmented Generation with Complex Reasoning over Structured Documents), a benchmark designed to assess these key dimensions holistically. CReSt comprises 2,245 human-annotated examples in English and Korean, designed to capture practical RAG scenarios that require complex reasoning over structured documents. It also introduces a tailored evaluation methodology to comprehensively assess model performance in these critical areas. Our evaluation shows that even advanced LLMs struggle to perform consistently across these dimensions, underscoring key areas for improvement. We release CReSt to support further research and the development of more robust RAG systems. The dataset and code are available at: https://github.com/UpstageAI/CReSt.
中文: 该摘要介绍了CReSt这一综合性基准,旨在整体评估大语言模型在检索增强生成场景中的复杂推理、拒答能力、引用准确性和文档布局理解等关键维度,研究表明即使先进模型在这些方面仍存在明显不足。
English: This abstract introduces CReSt, a comprehensive benchmark designed to holistically evaluate Large Language Models in Retrieval-Augmented Generation scenarios, focusing on complex reasoning, refusal capabilities, citation accuracy, and document layout understanding, revealing that even advanced models struggle across these dimensions.

Authors:Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
Title: HydraRAG: Structured Cross-Source Enhanced Large Language Model Reasoning
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. Current hybrid RAG system retrieves evidence from both knowledge graphs (KGs) and text documents to support LLM reasoning. However, it faces challenges like handling multi-hop reasoning, multi-entity questions, multi-source verification, and effective graph utilization. To address these limitations, we present HydraRAG, a training-free framework that unifies graph topology, document semantics, and source reliability to support deep, faithful reasoning in LLMs. HydraRAG handles multi-hop and multi-entity problems through agent-driven exploration that combines structured and unstructured retrieval, increasing both diversity and precision of evidence. To tackle multi-source verification, HydraRAG uses a tri-factor cross-source verification (source trustworthiness assessment, cross-source corroboration, and entity-path alignment), to balance topic relevance with cross-modal agreement. By leveraging graph structure, HydraRAG fuses heterogeneous sources, guides efficient exploration, and prunes noise early. Comprehensive experiments on seven benchmark datasets show that HydraRAG achieves overall state-of-the-art results on all benchmarks with GPT-3.5-Turbo, outperforming the strong hybrid baseline ToG-2 by an average of 20.3% and up to 30.1%. Furthermore, HydraRAG enables smaller models (e.g., Llama-3.1-8B) to achieve reasoning performance comparable to that of GPT-4-Turbo. The source code is available on https://stevetantan.github.io/HydraRAG/.
中文: HydraRAG是一个无需训练的新框架,通过统一图拓扑、文档语义和来源可靠性来解决多跳推理和多实体问题,在多个基准测试中均取得了最优性能。
English: HydraRAG is a training-free framework that enhances large language models by unifying graph topology, document semantics, and source reliability to solve multi-hop reasoning and multi-entity problems while achieving state-of-the-art results across benchmarks.

Authors:Wei Jie Yeo, Rui Mao, Moloud Abdar, Erik Cambria, Ranjan Satapathy
Title: Debiasing CLIP: Interpreting and Correcting Bias in Attention Heads
Abstract:
Multimodal models like CLIP have gained significant attention due to their remarkable zero-shot performance across various tasks. However, studies have revealed that CLIP can inadvertently learn spurious associations between target variables and confounding factors. To address this, we introduce \textsc{Locate-Then-Correct} (LTC), a contrastive framework that identifies spurious attention heads in Vision Transformers via mechanistic insights and mitigates them through targeted ablation. Furthermore, LTC identifies salient, task-relevant attention heads, enabling the integration of discriminative features through orthogonal projection to improve classification performance. We evaluate LTC on benchmarks with inherent background and gender biases, achieving over a $>50\%$ gain in worst-group accuracy compared to non-training post-hoc baselines. Additionally, we visualize the representation of selected heads and find that the presented interpretation corroborates our contrastive mechanism for identifying both spurious and salient attention heads. Code available at https://github.com/wj210/CLIP_LTC.
Chinese: 提出的“定位后修正”框架通过针对特定注意力头识别并缓解CLIP模型中的伪关联,在存在偏见的基准测试中显著提升了最差组准确率。
English: The proposed \textsc{Locate-Then-Correct} framework identifies and mitigates spurious associations in CLIP models by targeting specific attention heads, significantly improving worst-group accuracy on biased benchmarks.

Authors:Haoyu Sun, Huichen Will Wang, Jiawei Gu, Linjie Li, Yu Cheng
Title: FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow
Abstract:
Front-end engineering involves a complex workflow where engineers conceptualize designs, translate them into code, and iteratively refine the implementation. While recent benchmarks primarily focus on converting visual designs to code, we present FullFront, a benchmark designed to evaluate Multimodal Large Language Models (MLLMs) \textbf{across the full front-end development pipeline}. FullFront assesses three fundamental tasks that map directly to the front-end engineering pipeline: Webpage Design (conceptualization phase), Webpage Perception QA (comprehension of visual organization and elements), and Webpage Code Generation (implementation phase). Unlike existing benchmarks that use either scraped websites with bloated code or oversimplified LLM-generated HTML, FullFront employs a novel, two-stage process to transform real-world webpages into clean, standardized HTML while maintaining diverse visual designs and avoiding copyright issues. Extensive testing of state-of-the-art MLLMs reveals significant limitations in page perception, code generation (particularly for image handling and layout), and interaction implementation. Our results quantitatively demonstrate performance disparities across models and tasks, and highlight a substantial gap between current MLLM capabilities and human expert performance in front-end engineering. The FullFront benchmark and code are available in https://github.com/Mikivishy/FullFront.
中文摘要:FullFront是一个评估多模态大语言模型在前端开发全流程中性能的综合基准,通过三阶段任务测试发现现有模型在页面感知、代码生成和交互实现方面与人类专家存在显著差距。
English Summary: FullFront is a comprehensive benchmark designed to evaluate Multimodal Large Language Models across the entire front-end development pipeline, revealing significant performance gaps compared to human experts in webpage perception, code generation, and interaction implementation.

Authors:Hitesh Laxmichand Patel, Amit Agarwal, Arion Das, Bhargava Kumar, Srikant Panda, Priyaranjan Pattnayak, Taki Hasan Rafi, Tejaswini Kumar, Dong-Kyu Chae
Title: SweEval: Do LLMs Really Swear? A Safety Benchmark for Testing Limits for Enterprise Use
Abstract:
Enterprise customers are increasingly adopting Large Language Models (LLMs) for critical communication tasks, such as drafting emails, crafting sales pitches, and composing casual messages. Deploying such models across different regions requires them to understand diverse cultural and linguistic contexts and generate safe and respectful responses. For enterprise applications, it is crucial to mitigate reputational risks, maintain trust, and ensure compliance by effectively identifying and handling unsafe or offensive language. To address this, we introduce SweEval, a benchmark simulating real-world scenarios with variations in tone (positive or negative) and context (formal or informal). The prompts explicitly instruct the model to include specific swear words while completing the task. This benchmark evaluates whether LLMs comply with or resist such inappropriate instructions and assesses their alignment with ethical frameworks, cultural nuances, and language comprehension capabilities. In order to advance research in building ethically aligned AI systems for enterprise use and beyond, we release the dataset and code: https://github.com/amitbcp/multilingual_profanity.
中文: 企业采用大语言模型处理通信任务时需确保其理解文化背景并安全回应,为此推出SweEval基准测试模型对不当指令的遵循情况,以评估伦理对齐并降低风险。
English: Enterprises are adopting LLMs for communication tasks but need them to handle cultural contexts safely, so SweEval benchmark tests model compliance with inappropriate instructions to ensure ethical alignment and reduce risks.

Authors:Amit Agarwal, Srikant Panda, Kulbhushan Pachauri
Title: FS-DAG: Few Shot Domain Adapting Graph Networks for Visually Rich Document Understanding
Abstract:
In this work, we propose Few Shot Domain Adapting Graph (FS-DAG), a scalable and efficient model architecture for visually rich document understanding (VRDU) in few-shot settings. FS-DAG leverages domain-specific and language/vision specific backbones within a modular framework to adapt to diverse document types with minimal data. The model is robust to practical challenges such as handling OCR errors, misspellings, and domain shifts, which are critical in real-world deployments. FS-DAG is highly performant with less than 90M parameters, making it well-suited for complex real-world applications for Information Extraction (IE) tasks where computational resources are limited. We demonstrate FS-DAG's capability through extensive experiments for information extraction task, showing significant improvements in convergence speed and performance compared to state-of-the-art methods. Additionally, this work highlights the ongoing progress in developing smaller, more efficient models that do not compromise on performance. Code : https://github.com/oracle-samples/fs-dag
中文: FS-DAG是一种针对少样本场景下视觉丰富文档理解的可扩展高效模型,通过模块化领域专用架构以不足9000万参数处理OCR错误和领域偏移,在信息抽取任务中展现出卓越性能和更快收敛速度。
English: FS-DAG is a scalable and efficient model for visually rich document understanding in few-shot settings, leveraging modular domain-specific backbones to handle OCR errors and domain shifts with under 90M parameters, demonstrating superior performance and faster convergence in information extraction tasks.

Authors:Phat Thanh Dang, Saahil Thoppay, Wang Yang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
Title: SELF: Self-Extend the Context Length With Logistic Growth Function
Abstract:
Large language models suffer issues when operated on long contexts that are larger than their training context length due to the standard position encoding for tokens in the attention layer. Tokens a long distance apart will rarely have an effect on each other and long prompts yield unexpected results. To solve this problem, we propose SELF (Self-Extend the Context Length With Logistic Growth Function): a solution of grouping consecutive tokens at varying group sizes using a logistic capacity equation combined with a constant group size at smaller relative distances. Our model had an increase in performance of up to 12% compared to the LongLM extension method in LEval (specifically on the Qwen model). On summarization related tasks in LongBench, our model performed up to 6.4% better than LongLM (specifically on the Llama-2-7b model). On reading comprehension tasks from LEval, our model performed up to 5.4% better than the LongLM. Our code is available at https://github.com/alexeipc/SELF-LLM.
中文:SELF方法通过逻辑函数对连续标记进行分组来扩展语言模型的有效上下文长度,在多项长文本任务中相比现有方法最高实现了12%的性能提升。
English: The SELF method extends language models' effective context length by grouping tokens with a logistic function, achieving performance improvements of up to 12% over existing methods on various long-context tasks.

Authors:Razvan-Gabriel Dumitru, Darius Peteleaza, Vikas Yadav, Liangming Pan
Title: ConciseRL: Conciseness-Guided Reinforcement Learning for Efficient Reasoning Models
Abstract:
Large language models excel at complex tasks by breaking down problems into structured reasoning steps. However, reasoning traces often extend beyond reaching a correct answer, causing wasted computation, reduced readability, and hallucinations. To address this, we introduce a novel hyperparameter-free conciseness score used as a reward signal within a reinforcement learning framework to guide models toward generating correct and concise reasoning traces. This score is evaluated by a large language model acting as a judge, enabling dynamic, context-aware feedback beyond simple token length. Our method achieves state-of-the-art efficiency-accuracy trade-offs on the MATH dataset, reducing token usage by up to 31x on simple problems while improving accuracy by 7%, and on the hardest problems, it outperforms full reasoning by +7.5% accuracy with up to 3.6x fewer tokens. On TheoremQA, our method improves accuracy by +2.2% using 12.5x fewer tokens. We also conduct ablation studies on the judge model, reward composition, and problem difficulty, showing that our method dynamically adapts reasoning length based on problem difficulty and benefits significantly from stronger judges. The code, model weights, and datasets are open-sourced at https://github.com/RazvanDu/ConciseRL.
中文: 本文提出一种无超参数简洁度评分方法,通过强化学习框架优化推理路径的准确性与简洁性,在多个数据集上实现最优效率-准确率平衡,并能根据问题难度动态调整推理长度。
English: This paper introduces a hyperparameter-free conciseness score integrated into a reinforcement learning framework to optimize reasoning traces for both correctness and brevity, achieving superior efficiency-accuracy trade-offs across multiple datasets while dynamically adapting to problem difficulty.

Authors:Georgios Chochlakis, Peter Wu, Arjun Bedi, Marcus Ma, Kristina Lerman, Shrikanth Narayanan
Title: Humans Hallucinate Too: Language Models Identify and Correct Subjective Annotation Errors With Label-in-a-Haystack Prompts
Abstract:
Modeling complex subjective tasks in Natural Language Processing, such as recognizing emotion and morality, is considerably challenging due to significant variation in human annotations. This variation often reflects reasonable differences in semantic interpretations rather than mere noise, necessitating methods to distinguish between legitimate subjectivity and error. We address this challenge by exploring label verification in these contexts using Large Language Models (LLMs). First, we propose a simple In-Context Learning binary filtering baseline that estimates the reasonableness of a document-label pair. We then introduce the Label-in-a-Haystack setting: the query and its label(s) are included in the demonstrations shown to LLMs, which are prompted to predict the label(s) again, while receiving task-specific instructions (e.g., emotion recognition) rather than label copying. We show how the failure to copy the label(s) to the output of the LLM are task-relevant and informative. Building on this, we propose the Label-in-a-Haystack Rectification (LiaHR) framework for subjective label correction: when the model outputs diverge from the reference gold labels, we assign the generated labels to the example instead of discarding it. This approach can be integrated into annotation pipelines to enhance signal-to-noise ratios. Comprehensive analyses, human evaluations, and ecological validity studies verify the utility of LiaHR for label correction. Code is available at https://github.com/gchochla/liahr.
中文摘要:该研究针对主观性自然语言处理任务中的标注差异问题,提出了"大海捞针式标签修正"框架,利用大语言模型识别并修正标签偏差,通过验证后的标签替换而非丢弃不一致数据来提升标注质量。
English Summary: The study addresses annotation variability in subjective NLP tasks by introducing the Label-in-a-Haystack Rectification framework, which uses LLMs to identify and correct label discrepancies, thereby improving annotation quality through validated label replacement instead of discarding inconsistent data.

Authors:Kangda Wei, Hasnat Md Abdullah, Ruihong Huang
Title: Mitigating Gender Bias via Fostering Exploratory Thinking in LLMs
Abstract:
Large Language Models (LLMs) often exhibit gender bias, resulting in unequal treatment of male and female subjects across different contexts. To address this issue, we propose a novel data generation framework that fosters exploratory thinking in LLMs. Our approach prompts models to generate story pairs featuring male and female protagonists in structurally identical, morally ambiguous scenarios, then elicits and compares their moral judgments. When inconsistencies arise, the model is guided to produce balanced, gender-neutral judgments. These story-judgment pairs are used to fine-tune or optimize the models via Direct Preference Optimization (DPO). Experimental results show that our method significantly reduces gender bias while preserving or even enhancing general model capabilities. We will release the code and generated data. We release the code and generated data at: https://github.com/WeiKangda/LLMs-Exploratory-Bias-Mitigation/tree/main.
Chinese: 本研究提出了一种数据生成框架,通过创建平衡的故事对并采用直接偏好优化方法,有效减少大型语言模型中的性别偏见,同时保持模型性能。
English: This study introduces a data generation framework to mitigate gender bias in Large Language Models by creating balanced story pairs and using Direct Preference Optimization, which effectively reduces bias while maintaining model performance.

Authors:Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin
Title: OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning
Abstract:
Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across diverse visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the lack of a systematic benchmark. To address this gap, we propose OCR-Reasoning, a comprehensive benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. The benchmark comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Furthermore, unlike other text-rich image understanding benchmarks that only annotate the final answers, OCR-Reasoning also annotates the reasoning process simultaneously. With the annotated reasoning process and the final answers, OCR-Reasoning evaluates not only the final answers generated by models but also their reasoning processes, enabling a holistic analysis of their problem-solving abilities. Leveraging this benchmark, we conducted a comprehensive evaluation of state-of-the-art MLLMs. Our results demonstrate the limitations of existing methodologies. Notably, even state-of-the-art MLLMs exhibit substantial difficulties, with none achieving accuracy surpassing 50\% across OCR-Reasoning, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.
中文: OCR-Reasoning基准测试旨在系统评估多模态大语言模型在文本丰富图像推理任务中的表现,结果显示即使最先进的模型也面临巨大挑战,准确率均未超过50%,凸显了解决这一问题的紧迫性。
English: The OCR-Reasoning benchmark is introduced to systematically evaluate Multimodal Large Language Models on text-rich image reasoning tasks, revealing that even state-of-the-art models struggle significantly with accuracy below 50%, highlighting an urgent need for improvement in this area.

Authors:Qin Chen, Yuanyi Ren, Xiaojun Ma, Yuyang Shi
Title: Large Language Models for Predictive Analysis: How Far Are They?
Abstract:
Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex decision-making tasks. With the burgeoning expectation to harness LLMs for predictive analysis, there is an urgent need to systematically assess their capability in this domain. However, there is a lack of relevant evaluations in existing studies. To bridge this gap, we introduce the \textbf{PredictiQ} benchmark, which integrates 1130 sophisticated predictive analysis queries originating from 44 real-world datasets of 8 diverse fields. We design an evaluation protocol considering text analysis, code generation, and their alignment. Twelve renowned LLMs are evaluated, offering insights into their practical use in predictive analysis. Generally, we believe that existing LLMs still face considerable challenges in conducting predictive analysis. See \href{https://github.com/Cqkkkkkk/PredictiQ}{Github}.
Chinese: 本文提出了PredictIQ基准,用于系统评估十二种大型语言模型的预测分析能力,发现尽管它们在此领域具有潜力,但仍面临重大挑战。
English: The PredictiQ benchmark is introduced to systematically evaluate the predictive analysis capabilities of twelve large language models, revealing that they still face significant challenges despite their potential in this domain.

Authors:Bohan Jin, Shuhan Qi, Kehai Chen, Xinyi Guo, Xuan Wang
Title: MDIT-Bench: Evaluating the Dual-Implicit Toxicity in Large Multimodal Models
Abstract:
The widespread use of Large Multimodal Models (LMMs) has raised concerns about model toxicity. However, current research mainly focuses on explicit toxicity, with less attention to some more implicit toxicity regarding prejudice and discrimination. To address this limitation, we introduce a subtler type of toxicity named dual-implicit toxicity and a novel toxicity benchmark termed MDIT-Bench: Multimodal Dual-Implicit Toxicity Benchmark. Specifically, we first create the MDIT-Dataset with dual-implicit toxicity using the proposed Multi-stage Human-in-loop In-context Generation method. Based on this dataset, we construct the MDIT-Bench, a benchmark for evaluating the sensitivity of models to dual-implicit toxicity, with 317,638 questions covering 12 categories, 23 subcategories, and 780 topics. MDIT-Bench includes three difficulty levels, and we propose a metric to measure the toxicity gap exhibited by the model across them. In the experiment, we conducted MDIT-Bench on 13 prominent LMMs, and the results show that these LMMs cannot handle dual-implicit toxicity effectively. The model's performance drops significantly in hard level, revealing that these LMMs still contain a significant amount of hidden but activatable toxicity. Data are available at https://github.com/nuo1nuo/MDIT-Bench.
中文: 本研究提出了双重隐性毒性和MDIT-Bench基准,发现当前大型多模态模型难以有效识别微妙的偏见与歧视问题,且在更高难度级别表现显著下降。
English: This study introduces dual-implicit toxicity and MDIT-Bench, a benchmark revealing that current large multimodal models struggle with detecting subtle prejudice and discrimination, especially at higher difficulty levels.

Authors:Subrata Biswas, Mohammad Nur Hossain Khan, Bashima Islam
Title: RAVEN: Query-Guided Representation Alignment for Question Answering over Audio, Video, Embedded Sensors, and Natural Language
Abstract:
Multimodal question answering (QA) often requires identifying which video, audio, or sensor tokens are relevant to the question. Yet modality disagreements are common: off-camera speech, background noise, or motion outside the field of view often mislead fusion models that weight all streams equally. We present RAVEN, a unified QA architecture whose core is QuART, a query-conditioned cross-modal gating module that assigns scalar relevance scores to each token across modalities, enabling the model to amplify informative signals and suppress distractors before fusion. RAVEN is trained through a three-stage pipeline comprising unimodal pretraining, query-aligned fusion, and disagreement-oriented fine-tuning -- each stage targeting a distinct challenge in multi-modal reasoning: representation quality, cross-modal relevance, and robustness to modality mismatch. To support training and evaluation, we release AVS-QA, a dataset of 300K synchronized Audio--Video-Sensor streams paired with automatically generated question-answer pairs. Experimental results on seven multi-modal QA benchmarks -- including egocentric and exocentric tasks -- show that RAVEN achieves up to 14.5\% and 8.0\% gains in accuracy compared to state-of-the-art multi-modal large language models, respectively. Incorporating sensor data provides an additional 16.4\% boost, and the model remains robust under modality corruption, outperforming SOTA baselines by 50.23\%. Our code and dataset are available at https://github.com/BASHLab/RAVEN.
中文: RAVEN通过其核心组件QuART实现了跨模态查询条件门控机制,能动态评估各模态标记的相关性以增强有效信号并抑制干扰,在多模态问答基准测试中显著提升准确率,并在模态受损情况下保持优异鲁棒性。
English: RAVEN introduces QuART, a query-conditioned cross-modal gating module that dynamically weights tokens across modalities to enhance relevant signals and suppress distractors, achieving significant accuracy improvements on multimodal QA benchmarks while maintaining robustness against modality corruption.

Authors:Xiaozhao Liu, Dinggang Shen, Xihui Liu
Title: Learning Interpretable Representations Leads to Semantically Faithful EEG-to-Text Generation
Abstract:
Pretrained generative models have opened new frontiers in brain decoding by enabling the synthesis of realistic texts and images from non-invasive brain recordings. However, the reliability of such outputs remains questionable--whether they truly reflect semantic activation in the brain, or are merely hallucinated by the powerful generative models. In this paper, we focus on EEG-to-text decoding and address its hallucination issue through the lens of posterior collapse. Acknowledging the underlying mismatch in information capacity between EEG and text, we reframe the decoding task as semantic summarization of core meanings rather than previously verbatim reconstruction of stimulus texts. To this end, we propose the Generative Language Inspection Model (GLIM), which emphasizes learning informative and interpretable EEG representations to improve semantic grounding under heterogeneous and small-scale data conditions. Experiments on the public ZuCo dataset demonstrate that GLIM consistently generates fluent, EEG-grounded sentences without teacher forcing. Moreover, it supports more robust evaluation beyond text similarity, through EEG-text retrieval and zero-shot semantic classification across sentiment categories, relation types, and corpus topics. Together, our architecture and evaluation protocols lay the foundation for reliable and scalable benchmarking in generative brain decoding.
中文: 预训练生成模型通过从脑电图信号合成文本推进了脑解码,但存在可靠性问题,本研究通过语义概括方法和提出的GLIM模型来解决,以增强语义基础和评估。
English: Pretrained generative models advance brain decoding by synthesizing text from EEG signals, but face reliability issues, which this study addresses through a semantic summarization approach and the proposed GLIM model to enhance grounding and evaluation.

Authors:Kaibo Huang, Zipei Zhang, Yukun Wei, TianXin Zhang, Zhongliang Yang, Linna Zhou
Title: GSDFuse: Capturing Cognitive Inconsistencies from Multi-Dimensional Weak Signals in Social Media Steganalysis
Abstract:
The ubiquity of social media platforms facilitates malicious linguistic steganography, posing significant security risks. Steganalysis is profoundly hindered by the challenge of identifying subtle cognitive inconsistencies arising from textual fragmentation and complex dialogue structures, and the difficulty in achieving robust aggregation of multi-dimensional weak signals, especially given extreme steganographic sparsity and sophisticated steganography. These core detection difficulties are compounded by significant data imbalance. This paper introduces GSDFuse, a novel method designed to systematically overcome these obstacles. GSDFuse employs a holistic approach, synergistically integrating hierarchical multi-modal feature engineering to capture diverse signals, strategic data augmentation to address sparsity, adaptive evidence fusion to intelligently aggregate weak signals, and discriminative embedding learning to enhance sensitivity to subtle inconsistencies. Experiments on social media datasets demonstrate GSDFuse's state-of-the-art (SOTA) performance in identifying sophisticated steganography within complex dialogue environments. The source code for GSDFuse is available at https://github.com/NebulaEmmaZh/GSDFuse.
中文摘要:本文提出GSDFuse新方法,通过整合分层特征工程、数据增强、自适应融合和判别学习,系统解决了社交媒体恶意语言隐写检测中的核心难题,在复杂对话环境中实现了最先进的检测性能。
English Summary: This paper introduces GSDFuse, a novel method that overcomes key challenges in detecting malicious linguistic steganography on social media by integrating hierarchical feature engineering, data augmentation, adaptive fusion, and discriminative learning, achieving state-of-the-art performance.

Authors:Yiduo Guo, Zhen Guo, Chuanwei Huang, Zi-Ang Wang, Zekai Zhang, Haofei Yu, Huishuai Zhang, Yikang Shen
Title: Synthetic Data RL: Task Definition Is All You Need
Abstract:
Reinforcement learning (RL) is a powerful way to adapt foundation models to specialized tasks, but its reliance on large-scale human-labeled data limits broad adoption. We introduce Synthetic Data RL, a simple and general framework that reinforcement fine-tunes models using only synthetic data generated from a task definition. Our method first generates question and answer pairs from the task definition and retrieved documents, then adapts the difficulty of the question based on model solvability, and selects questions using the average pass rate of the model across samples for RL training. On Qwen-2.5-7B, our method achieves a 29.2% absolute improvement over the base model on GSM8K (+2.9 pp vs. instruction-tuned, +6.6 pp vs. Self-Instruct), 8.7% on MATH, 13.1% on GPQA (+7.0 pp vs. SynthLLM), 8.9% on MedQA, 17.7% on CQA (law) and 13.7% on CFA (finance). It surpasses supervised fine-tuning under the same data budget and nearly matches RL with full human data across datasets (e.g., +17.2 pp on GSM8K). Adding 100 human demonstrations improves the performance of GSM8K only by 0.4 pp, showing a limited added value. By reducing human data annotation, Synthetic Data RL enables scalable and efficient RL-based model adaptation. Code and demos are available at https://github.com/gydpku/Data_Synthesis_RL/.
中文: 合成数据强化学习提出了一种仅通过任务定义生成合成数据来微调模型的框架,在多个基准测试中取得显著性能提升,同时减少了对人工标注数据的依赖。
English: Synthetic Data RL introduces a framework that fine-tunes models using only synthetic data generated from task definitions, achieving significant performance improvements across various benchmarks while reducing reliance on human-labeled data.

Authors:Xinlong Chen, Yuanxing Zhang, Qiang Liu, Junfei Wu, Fuzheng Zhang, Tieniu Tan
Title: Mixture of Decoding: An Attention-Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision-Language Models
Abstract:
Large Vision-Language Models (LVLMs) have exhibited impressive capabilities across various visual tasks, yet they remain hindered by the persistent challenge of hallucinations. To address this critical issue, we propose Mixture of Decoding (MoD), a novel approach for hallucination mitigation that dynamically adapts decoding strategies by evaluating the correctness of the model's attention on image tokens. Specifically, MoD measures the consistency between outputs generated from the original image tokens and those derived from the model's attended image tokens, to distinguish the correctness aforementioned. If the outputs are consistent, indicating correct attention, MoD employs a complementary strategy to amplify critical information. Conversely, if the outputs are inconsistent, suggesting erroneous attention, MoD utilizes a contrastive strategy to suppress misleading information. Extensive experiments demonstrate that MoD significantly outperforms existing decoding methods across multiple mainstream benchmarks, effectively mitigating hallucinations in LVLMs. The code is available at https://github.com/xlchen0205/MoD.
中文: 提出的混合解码方法通过评估原始图像标记与模型关注标记输出的一致性,动态调整解码策略,有效缓解大型视觉语言模型的幻觉问题,并在多个基准测试中显著优于现有方法。
English: The proposed Mixture of Decoding (MoD) approach dynamically adjusts decoding strategies by assessing the consistency between outputs from original and attended image tokens, effectively mitigating hallucinations in Large Vision-Language Models and outperforming existing methods across multiple benchmarks.

Authors:Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang
Title: SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation
Abstract:
In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.
Chinese: SALMONN-omni首次提出了无需音频编解码器的独立全双工语音大模型,其动态思维机制能自主切换听说状态,在语音问答基准测试中性能提升超30%,并在复杂对话场景中表现卓越。
English: SALMONN-omni introduces the first standalone full-duplex speech LLM that eliminates audio codecs and incorporates a dynamic thinking mechanism to seamlessly switch between speaking and listening, achieving over 30% performance improvement in benchmarks while excelling in complex conversational scenarios.

Authors:Chengqi Duan, Rongyao Fang, Yuqing Wang, Kun Wang, Linjiang Huang, Xingyu Zeng, Hongsheng Li, Xihui Liu
Title: GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation with Reinforcement Learning
Abstract:
Visual generation models have made remarkable progress in creating realistic images from text prompts, yet struggle with complex prompts that specify multiple objects with precise spatial relationships and attributes. Effective handling of such prompts requires explicit reasoning about the semantic content and spatial layout. We present GoT-R1, a framework that applies reinforcement learning to enhance semantic-spatial reasoning in visual generation. Building upon the Generation Chain-of-Thought approach, GoT-R1 enables models to autonomously discover effective reasoning strategies beyond predefined templates through carefully designed reinforcement learning. To achieve this, we propose a dual-stage multi-dimensional reward framework that leverages MLLMs to evaluate both the reasoning process and final output, enabling effective supervision across the entire generation pipeline. The reward system assesses semantic alignment, spatial accuracy, and visual quality in a unified approach. Experimental results demonstrate significant improvements on T2I-CompBench benchmark, particularly in compositional tasks involving precise spatial relationships and attribute binding. GoT-R1 advances the state-of-the-art in image generation by successfully transferring sophisticated reasoning capabilities to the visual generation domain. To facilitate future research, we make our code and pretrained models publicly available at https://github.com/gogoduan/GoT-R1.
中文摘要:GoT-R1是一个强化学习框架,通过让模型自主开发复杂文本提示的推理策略,并采用统一奖励机制评估语义对齐与空间精度,显著提升了多对象空间关系和属性绑定的图像生成能力。
English Summary: GoT-R1 is a reinforcement learning framework that enhances visual generation by enabling models to autonomously develop reasoning strategies for complex text prompts, achieving superior performance in spatial relationships and attribute binding through a unified reward system.

Authors:Chengzhuo Tong, Ziyu Guo, Renrui Zhang, Wenyu Shan, Xinyu Wei, Zhenghao Xing, Hongsheng Li, Pheng-Ann Heng
Title: Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
Abstract:
Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文: 本研究首次对自回归图像生成中的GRPO和DPO强化学习算法进行全面分析,揭示了它们各自的优势,并证明具有更强泛化能力的奖励模型能够同时提升域内性能和跨域泛化能力。
English: This study provides the first comprehensive analysis of GRPO and DPO reinforcement learning algorithms in autoregressive image generation, revealing their distinct advantages and demonstrating how reward models with stronger generalization capabilities can enhance both in-domain performance and out-of-domain generalization.

Authors:Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Title: R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning
Abstract:
Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at https://github.com/RUCAIBox/R1-Searcher-plus.
中文: 本文提出R1-Searcher++框架,通过两阶段训练策略使大语言模型能够自适应地利用内部和外部知识,相比现有方法实现了更优的性能和高效的检索能力。
English: This paper introduces R1-Searcher++, a framework that trains LLMs to adaptively use both internal and external knowledge through a two-stage training strategy, achieving superior performance and efficient retrieval compared to previous methods.

Authors:Jin Jiang, Jianing Wang, Yuchen Yan, Yang Liu, Jianhua Zhu, Mengdi Zhang, Xunliang Cai, Liangcai Gao
Title: Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?
Abstract:
Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.
Chinese: 本研究利用形式语言全面评估大语言模型的逻辑推理能力,发现思维模型优于指令模型,所有模型在归纳推理方面存在不足,而PoT格式数据泛化性能最佳,且拒绝式微调可进一步提升模型表现。
English: This study comprehensively evaluates large language models' logical reasoning capabilities using formal languages, finding that thinking models outperform instruct models, all models struggle with inductive reasoning, and PoT-formatted data yields the best generalization, with rejected fine-tuning further enhancing performance.

Authors:Runyang You, Yongqi Li, Xinyu Lin, Xin Zhang, Wenjie Wang, Wenjie Li, Liqiang Nie
Title: $\text{R}^2\text{ec}$: Towards Large Recommender Models with Reasoning
Abstract:
Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67\% in Hit@5 and 45.21\% in NDCG@20. Code available at https://github.com/YRYangang/RRec.
Chinese: 作者提出了一种具有内在推理能力的统一大型推荐模型,通过自回归架构和强化学习框架将推理与推荐相结合,无需专门推理标注即可同时优化两者,实现了显著的性能提升。
English: The authors propose a unified large recommender model with intrinsic reasoning capabilities, integrating reasoning and recommendation through an autoregressive architecture and a reinforcement learning framework that optimizes both without requiring specialized reasoning annotations, achieving significant performance improvements.

Authors:Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen
Title: LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding
Abstract:
Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository https://github.com/EIT-NLP/StreamingLLM.
Chinese: 本研究通过识别关键不匹配问题并引入无需修改架构的分组位置编码方法,有效提升了批处理大语言模型在流式应用中的性能,并在多任务实验中验证了其优越性。
English: This study addresses the inefficiencies in adapting batch-oriented Large Language Models for streaming by identifying key mismatches and introducing a group position encoding method that enhances performance without architectural changes, validated across diverse tasks.

Authors:Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen
Title: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
Abstract:
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{https://github.com/DorothyDUUU/SWE-Dev}{https://github.com/DorothyDUUU/SWE-Dev}.
中文: SWE-Dev数据集作为首个针对现实世界功能开发任务的大规模基准,不仅揭示了当前AI模型在此领域的重大挑战,还通过高质量训练数据显著提升了模型性能,使7B参数模型在困难任务上达到与GPT-4o相当的水平。
English: The SWE-Dev dataset is introduced as the first large-scale benchmark for evaluating and training autonomous coding systems on real-world feature development tasks, demonstrating both the challenge of this domain for current AI models and the dataset's effectiveness in enabling significant model improvements through fine-tuning.

Authors:Nandan Thakur, Crystina Zhang, Xueguang Ma, Jimmy Lin
Title: Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
Abstract:
Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35$\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.
中文: 通过剔除低质量数据集并采用级联大语言模型提示重标注假阴性样本,显著提升了检索和重排序模型在BEIR与AIR-Bench评估中的性能表现。
English: Pruning low-quality datasets and using cascading LLM prompts to relabel false negatives significantly improves retrieval and reranker model performance on BEIR and AIR-Bench evaluations.

Authors:InternAgent Team, Bo Zhang, Shiyang Feng, Xiangchao Yan, Jiakang Yuan, Runmin Ma, Yusong Hu, Zhiyin Yu, Xiaohan He, Songtao Huang, Shaowei Hou, Zheng Nie, Zhilong Wang, Jinyao Liu, Tianshuo Peng, Peng Ye, Dongzhan Zhou, Shufei Zhang, Xiaosong Wang, Yilan Zhang, Meng Li, Zhongying Tu, Xiangyu Yue, Wangli Ouyang, Bowen Zhou, Lei Bai
Title: InternAgent: When Agent Becomes the Scientist -- Building Closed-Loop System from Hypothesis to Verification
Abstract:
Artificial Intelligence (AI) is accelerating the transformation of scientific research paradigms, not only enhancing research efficiency but also driving innovation. We introduce InternAgent, a unified closed-loop multi-agent framework to conduct Autonomous Scientific Research (ASR) across various scientific research fields, enabling researchers to tackle complicated problems in these fields with unprecedented speed and precision. InternAgent highlights three key advantages: 1) Scalability: InternAgent has demonstrated its versatility across 12 scientific research tasks, capable of generating innovative ideas to enhance the performance of baseline code. 2) Interactivity: InternAgent provides an interface for human expert feedback and multi-agent interaction in automated end-to-end processes, allowing for the seamless integration of domain expert knowledge. 3) Efficiency: InternAgent has achieved promising performance gains in several scientific fields with significantly less time cost compared to human efforts. For instance, in reaction yield prediction, it increased from 27.6% to 35.4% in just 12 hours; in enhancer activity prediction, accuracy rose from 0.65 to 0.79 with only 4 hours of processing; and in 2D semantic segmentation, precision advanced from 78.8% to 81.0% in a mere 30 hours.
中文: InternAgent作为一种统一的多智能体框架,通过其可扩展性、交互性和高效性,在多个科学领域推动自主研究,实现快速创新并与领域专家无缝协作。
English: InternAgent is a unified multi-agent framework that accelerates autonomous scientific research across various fields, offering scalability, interactivity, and efficiency by enabling rapid innovation and seamless human-expert collaboration.

Authors:Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, Chongxuan Li
Title: LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
Abstract:
In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant in current multimodal approaches. Built upon LLaDA, a representative large language diffusion model, LLaDA-V incorporates a vision encoder and MLP connector that projects visual features into the language embedding space, enabling effective multimodal alignment. Our empirical investigation reveals several intriguing results: First, LLaDA-V demonstrates promising multimodal performance despite its language model being weaker on purely textual tasks than counterparts like LLaMA3-8B and Qwen2-7B. When trained on the same instruction data, LLaDA-V is highly competitive to LLaMA3-V across multimodal tasks with better data scalability. It also narrows the performance gap to Qwen2-VL, suggesting the effectiveness of its architecture for multimodal tasks. Second, LLaDA-V achieves state-of-the-art performance in multimodal understanding compared to existing hybrid autoregressive-diffusion and purely diffusion-based MLLMs. Our findings suggest that large language diffusion models show promise in multimodal contexts and warrant further investigation in future research. Project page and codes: https://ml-gsai.github.io/LLaDA-V-demo/.
中文: LLaDA-V是一种基于扩散的多模态模型,通过将视觉指令调整与掩码扩散模型相结合,在文本任务能力较弱的情况下仍展现出竞争力的多模态性能,并在同类扩散模型中实现了最先进的效果。
English: LLaDA-V is a diffusion-based multimodal model that integrates visual instruction tuning with masked diffusion models, demonstrating competitive performance in multimodal tasks despite weaker text-only capabilities and achieving state-of-the-art results among diffusion-based MLLMs.

Authors:Daniel F. Perez-Ramirez, Dejan Kostic, Magnus Boman
Title: CASTILLO: Characterizing Response Length Distributions of Large Language Models
Abstract:
Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.
中文摘要:CASTILLO数据集系统刻画了13种开源大语言模型的响应长度分布,揭示了显著的长度变异性,为开发预测模型以实现推理资源的前瞻性调度提供了基础。
English Summary: CASTILLO is a dataset that characterizes response length distributions across 13 open-source LLMs, revealing significant variability and enabling predictive models for proactive resource allocation in LLM inference.

Authors:Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, Dacheng Tao
Title: R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search
Abstract:
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at https://github.com/w-yibo/R1-Compress
Chinese: R1-Compress是一种两阶段分块压缩框架,通过在MATH500等测试中实现92.4%的准确率并减少20%的token使用量,有效解决了长链思维推理中的计算效率问题。
English: R1-Compress is a two-stage chunk-level compression framework that significantly reduces token usage in Long-CoT reasoning while maintaining high reasoning accuracy, as demonstrated by achieving 92.4% accuracy with 20% fewer tokens on MATH500.

Authors:Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
Abstract:
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.
中文: SimpleDeepSearcher 是一个轻量级框架,通过模拟真实用户交互和多标准筛选策略合成高质量训练数据,仅需少量样本即可超越基于强化学习的基线方法,有效解决了检索增强生成系统的数据稀缺瓶颈。
English: SimpleDeepSearcher is a lightweight framework that overcomes limitations in retrieval-augmented generation systems by synthesizing high-quality training data through simulated user interactions and multi-criteria curation, achieving superior performance with minimal samples compared to reinforcement learning approaches.

Authors:Haonian Ji, Shi Qiu, Siyang Xin, Siwei Han, Zhaorun Chen, Dake Zhang, Hongyi Wang, Huaxiu Yao
Title: From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Reasoning-Driven Pedagogical Visualization
Abstract:
While foundation models (FMs), such as diffusion models and large vision-language models (LVLMs), have been widely applied in educational contexts, their ability to generate pedagogically effective visual explanations remains limited. Most existing approaches focus primarily on textual reasoning, overlooking the critical role of structured and interpretable visualizations in supporting conceptual understanding. To better assess the visual reasoning capabilities of FMs in educational settings, we introduce EduVisBench, a multi-domain, multi-level benchmark. EduVisBench features diverse STEM problem sets requiring visually grounded solutions, along with a fine-grained evaluation rubric informed by pedagogical theory. Our empirical analysis reveals that existing models frequently struggle with the inherent challenge of decomposing complex reasoning and translating it into visual representations aligned with human cognitive processes. To address these limitations, we propose EduVisAgent, a multi-agent collaborative framework that coordinates specialized agents for instructional planning, reasoning decomposition, metacognitive prompting, and visualization design. Experimental results show that EduVisAgent substantially outperforms all baselines, achieving a 40.2% improvement and delivering more educationally aligned visualizations. EduVisBench and EduVisAgent are available at https://github.com/aiming-lab/EduVisBench and https://github.com/aiming-lab/EduVisAgent.
Chinese: 本文提出了用于评估教育领域基础模型视觉推理能力的基准EduVisBench,并开发了多智能体框架EduVisAgent,该框架通过协同合作显著提升了符合教学需求的可视化内容生成效果。
English: This paper introduces EduVisBench, a benchmark for evaluating the visual reasoning capabilities of foundation models in education, and proposes EduVisAgent, a multi-agent framework that significantly enhances the generation of pedagogically effective visualizations.

Authors:Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Title: Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
Abstract:
Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.
中文: 当前大语言模型遗忘效果评估依赖的词汇级指标存在误导性,因为模型可能只是表面遗忘而信息仍可恢复,这凸显了需要建立表征分析框架来区分可逆与不可逆遗忘的必要性。
English: Current token-level metrics for evaluating unlearning in LLMs can be misleading, as models may only superficially forget information that remains recoverable, prompting the need for a new representational analysis framework to distinguish between reversible and irreversible forgetting.

Authors:Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin
Title: Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Abstract:
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.
中文摘要:针对特定领域数据微调大型语言模型可能意外引入脆弱性,这些脆弱性受数据集的语言特征和毒性等因素影响,从而削弱模型鲁棒性,并凸显了策略性数据集设计对防御的重要性。
English Summary: Fine-tuning large language models on domain-specific data can inadvertently introduce accidental vulnerabilities, which are influenced by factors like linguistic features and toxicity in the datasets, ultimately affecting model robustness and highlighting the importance of strategic dataset design for defense.

Authors:Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen
Title: Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning
Abstract:
Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at https://github.com/EIT-NLP/Awesome-Latent-CoT.
中文: 本文对大型语言模型中的潜在思维链推理进行了全面综述,提出了统一分类法并分析各类方法,旨在推动这种解耦式推理范式的发展,以实现更高效灵活的推理能力。
English: This paper provides a comprehensive overview of latent Chain-of-Thought reasoning in Large Language Models, proposing a unified taxonomy and analyzing methods to advance this decoupled reasoning approach for more efficient and flexible inference.

Authors:Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw
Title: IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models
Abstract:
Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.
中文: 大语言模型在多模态环境下的指令遵循能力常会减弱,为此我们开发了IFEval-Audio数据集,包含280组音频-指令-答案三元组,用于从六个维度评估音频大模型的指令执行能力。
English: Large language models' instruction-following ability often weakens in multimodal settings, prompting the creation of IFEval-Audio, a dataset with 280 audio-instruction-answer triples to evaluate audio-based LLMs across six dimensions.

Authors:Florentin Beck, William Rudman, Carsten Eickhoff
Title: TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
Abstract:
Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM
中文:TRIM提出了一种针对性的迭代剪枝方法,通过为各层内部维度分配差异化稀疏度,在多类大语言模型压缩中实现了最优性能与稳定性。
English: TRIM introduces a targeted, iterative pruning method that applies varying sparsity to individual dimensions within layers, achieving state-of-the-art performance and stability in LLM compression across multiple models.

Authors:Chengcan Wu, Zhixin Zhang, Zeming Wei, Yihao Zhang, Meng Sun
Title: Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
Abstract:
The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.
Chinese: 本文提出了一种安全感知探测(SAP)框架,通过在梯度传播过程中引入安全感知探针,有效减轻大型语言模型在微调过程中的安全性退化,在保持性能的同时显著降低有害性。
English: The paper introduces a safety-aware probing (SAP) framework that mitigates safety degradation in large language models during fine-tuning by incorporating safety-aware probes into gradient propagation, effectively reducing harmfulness while maintaining performance.

Authors:Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
Abstract:
As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 392 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.
中文摘要:本研究提出了跨语言去毒方法,通过大量实验验证了其在不同语言间降低大型语言模型毒性的有效性,同时揭示了安全性与知识保留之间的权衡关系。
English Summary: This study introduces cross-lingual detoxification to reduce toxicity in large language models across different languages, demonstrating its effectiveness through extensive testing while highlighting the trade-off between safety and knowledge preservation.

Authors:Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang
Title: R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Abstract:
In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.
中文摘要:本研究提出Share-GRPO方法,通过扩展问题空间并共享多样化推理路径和奖励信息,有效解决强化学习中的稀疏奖励和优势消失问题,从而提升多模态大语言模型的推理能力。
English Summary: This study introduces Share-GRPO, a reinforcement learning approach that enhances multimodal large language models' reasoning by expanding question spaces and sharing diverse reasoning trajectories and reward information to overcome sparse rewards and advantage vanishing issues.

Authors:Shinnosuke Ono, Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki
Title: A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
Abstract:
We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at https://github.com/EQUES-Inc/pharma-LLM-eval.
中文摘要:本研究开发了针对日语医药领域的专业语言模型,通过双语医学语料持续预训练,在超越现有开源模型的同时与商业模型性能相当,并建立了三个专业基准测试体系进行全面评估。
English Summary: This study introduces a Japanese pharmaceutical domain-specific language model, developed through continual pretraining on bilingual medical corpora, which outperforms existing open models and shows competitive performance with commercial ones while establishing three specialized benchmarks for comprehensive evaluation.

Authors:Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He
Title: Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering
Abstract:
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at https://github.com/opendatalab/KoLasSimpleQA .
中文:KoLasSimpleQA是首个评估大语言模型多语言事实知识能力的基准,涵盖九种语言和双领域设计,全面评估模型能力并揭示通用领域与语言特定领域之间的性能差异。
English: KoLasSimpleQA is the first multilingual factual knowledge benchmark for Large Language Models, featuring nine languages and dual-domain design to comprehensively assess capabilities and reveal performance gaps between general and language-specific domains.

Authors:Ercong Nie, Helmut Schmid, Hinrich Schütze
Title: Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models
Abstract:
Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion while largely preserving general competence and fluency. Our approach matches multilingual alignment in confusion reduction for many languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling. Code and data are available at: https://github.com/ercong21/lang_confusion.
中文摘要:本研究通过机制可解释性分析发现,大语言模型的语言混淆现象源于最终层的转换故障,并证明针对性神经元编辑能在保持模型性能的同时有效缓解该问题。
English Summary: This study uses mechanistic interpretability to identify that language confusion in LLMs stems from transition failures in final layers, demonstrating targeted neuron editing effectively mitigates the issue while maintaining model performance.

Authors:Yuliang Yan, Haochun Tang, Shuo Yan, Enyan Dai
Title: DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Abstract:
Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel $\textbf{Du}$al-Level $\textbf{Fin}$gerprinting $\textbf{F}$ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.
中文摘要:提出的DuFFin框架通过双重指纹识别技术,能在黑盒设置下精确验证大语言模型的所有权,在各类模型变体上实现了超过0.95的IP-ROC指标。
English Summary: The proposed DuFFin framework uses dual-level fingerprints to accurately verify the ownership of large language models in black-box settings, achieving high IP-ROC scores above 0.95 across various model variants.

Authors:Kuicai Dong, Yujing Chang, Shijie Huang, Yasheng Wang, Ruiming Tang, Yong Liu
Title: Benchmarking Retrieval-Augmented Multimomal Generation for Document Question Answering
Abstract:
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents (text, images, tables) and performing cross-modal reasoning. Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches, frequently missing critical visual information. The field also lacks robust benchmarks for assessing multimodal evidence selection and integration. We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains. Our framework introduces innovative metrics for evaluating multimodal quote selection and enables answers that interleave text with relevant visual elements. Through large-scale experiments with 60 VLM/LLM models and 14 retrieval systems, we identify persistent challenges in multimodal evidence retrieval, selection, and integration.Key findings reveal advanced proprietary LVMs show superior performance than open-sourced alternatives. Also, they show moderate advantages using multimodal inputs over text-only inputs, while open-source alternatives show significant performance degradation. Notably, fine-tuned LLMs achieve substantial improvements when using detailed image descriptions. MMDocRAG establishes a rigorous testing ground and provides actionable insights for developing more robust multimodal DocVQA systems. Our benchmark and code are available at https://mmdocrag.github.io/MMDocRAG/.
Chinese: MMDocRAG提出了包含4,055个专家标注问答对的综合基准及创新评估指标,通过大规模实验发现专有视觉语言模型在跨模态证据处理上显著优于开源模型,为多模态文档问答系统提供了重要改进方向。
English: MMDocRAG introduces a comprehensive benchmark with 4,055 QA pairs and novel metrics to address DocVQA's limitations in multimodal evidence handling, revealing through extensive testing that proprietary models outperform open-source alternatives and benefit more from visual inputs.

Authors:Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Title: Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models. Our code is available at https://github.com/ruizheliUOA/ARC_JSD
This paper introduces ARC-JSD, a novel method that uses Jensen-Shannon Divergence to efficiently attribute generated responses to specific context segments in Retrieval-Augmented Generation systems, eliminating the need for fine-tuning while demonstrating superior accuracy and computational efficiency across multiple benchmarks.
English Summary:

Authors:Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
Title: Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
Abstract:
Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.
中文:Tool-Star是一个基于强化学习的框架,通过两阶段训练和分层奖励设计,使大语言模型能够在推理过程中自主调用多种外部工具,在多项基准测试中展现出卓越性能。
English: Tool-Star is a reinforcement learning framework that enables large language models to autonomously use multiple external tools during reasoning through a two-stage training process and hierarchical reward design, demonstrating superior performance across various benchmarks.

Authors:Muhammad Farid Adilazuarda, Chen Cecilia Liu, Iryna Gurevych, Alham Fikri Aji
Title: From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs
Abstract:
Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior. We release our code at https://github.com/faridlazuarda/from-surveys-to-narratives.
中文摘要:本研究发现仅依赖世界价值观调查数据进行大语言模型的文化适应会简化文化规范并损害事实准确性,但通过维基百科和NormAd的文化叙事进行补充后,尽管对下游任务影响不一,却显著提升了文化独特性。
English Summary: This study reveals that relying solely on World Values Survey data for cultural adaptation in LLMs can oversimplify cultural norms and impair factual accuracy, but augmenting with cultural narratives from Wikipedia and NormAd enhances cultural distinctiveness despite variable task impacts.

Authors:Pierre Achkar, Tim Gollub, Martin Potthast
Title: Ask, Retrieve, Summarize: A Modular Pipeline for Scientific Literature Summarization
Abstract:
The exponential growth of scientific publications has made it increasingly difficult for researchers to stay updated and synthesize knowledge effectively. This paper presents XSum, a modular pipeline for multi-document summarization (MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The pipeline includes two core components: a question-generation module and an editor module. The question-generation module dynamically generates questions adapted to the input papers, ensuring the retrieval of relevant and accurate information. The editor module synthesizes the retrieved content into coherent and well-structured summaries that adhere to academic standards for proper citation. Evaluated on the SurveySum dataset, XSum demonstrates strong performance, achieving considerable improvements in metrics such as CheckEval, G-Eval and Ref-F1 compared to existing approaches. This work provides a transparent, adaptable framework for scientific summarization with potential applications in a wide range of domains. Code available at https://github.com/webis-de/scolia25-xsum
中文:本文提出XSum,一个基于检索增强生成的科学文献多文档摘要模块化流程,通过动态生成问题并整合检索信息形成连贯摘要,在基准评估中表现出优越性能。
English: This paper introduces XSum, a modular pipeline for scientific multi-document summarization using Retrieval-Augmented Generation, which dynamically generates questions and synthesizes retrieved information into coherent summaries, showing strong performance on benchmark evaluations.

Authors:Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo
Title: Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance
Abstract:
Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: https://connoriginal.github.io/MEMENTO

Authors:Wenqing Wu, Chengzhi Zhang, Tong Bao, Yi Zhao
Title: SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers
Abstract:
Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.
中文摘要:本研究通过语言模型探索学术论文中预测新颖性评分的最佳章节组合,发现引言、结果和讨论部分评估效果最佳,而全文分析效果不显著。
English Summary: This study explores optimal section combinations in academic papers for predicting novelty scores using language models, finding that introduction, results, and discussion sections yield the most effective assessment while full-text analysis proves less impactful.

Authors:Jiawei Liu, Qisi Chen, Jianshu Zhang, Quan Liu, Defu Lian
Title: EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning
Abstract:
Large Language Models (LLMs) excel at complex reasoning through search algorithms, yet current strategies often suffer from massive token consumption due to redundant exploration of semantically equivalent steps. Existing semantic similarity methods struggle to accurately identify such equivalence in domain-specific contexts like mathematical reasoning. To address this, we propose EquivPruner, a simple yet effective approach that identifies and prunes semantically equivalent actions during LLM reasoning search. We also introduce MathEquiv, the first dataset we created for mathematical statement equivalence, which enables the training of a lightweight equivalence detector. Extensive experiments across various models and tasks demonstrate that EquivPruner significantly reduces token consumption, improving searching efficiency and often bolstering reasoning accuracy. For instance, when applied to Qwen2.5-Math-7B-Instruct on GSM8K, EquivPruner reduced token consumption by 48.1\% while also improving accuracy. Our code is available at https://github.com/Lolo1222/EquivPruner.
Chinese Summary: EquivPruner方法通过剪枝语义等价步骤,显著降低了大型语言模型推理中的令牌消耗,在GSM8K任务上实现了48.1%的令牌削减并提升了准确性。
English Summary: The EquivPruner method effectively reduces token consumption in LLM reasoning by pruning semantically equivalent steps, enhancing both efficiency and accuracy, as demonstrated by a 48.1% token reduction on GSM8K.

Authors:Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong
Title: HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation
Abstract:
The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://github.com/nlp2ct-shijie/HiMATE.
中文摘要:HiMATE框架基于MQM错误类型构建分层多智能体系统,通过自我反思和智能体间非对称信息讨论,显著提升了机器翻译评估中错误定位与严重性判定的准确性。
English Summary: The HiMATE framework leverages a hierarchical multi-agent system based on MQM error typology to enhance machine translation evaluation, significantly improving error span detection and severity assessment through self-reflection and agent discussions.

Authors:Sampanna Yashwant Kahu, Naman Ahuja
Title: All You Need is "Leet": Evading Hate-speech Detection AI
Abstract:
Social media and online forums are increasingly becoming popular. Unfortunately, these platforms are being used for spreading hate speech. In this paper, we design black-box techniques to protect users from hate-speech on online platforms by generating perturbations that can fool state of the art deep learning based hate speech detection models thereby decreasing their efficiency. We also ensure a minimal change in the original meaning of hate-speech. Our best perturbation attack is successfully able to evade hate-speech detection for 86.8 % of hateful text.
Chinese: 本文设计了黑盒技术,通过生成扰动来规避最先进的仇恨言论检测模型,在保持原意基本不变的同时,成功使86.8%的仇恨文本逃过检测。
English: This paper develops black-box techniques to generate perturbations that evade state-of-the-art hate speech detection models, reducing their effectiveness by 86.8% while minimally altering the original text's meaning.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection
Abstract:
Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: https://github.com/aashish2000/IRONIC
中文摘要:IRONIC是一种新颖的情境学习框架,通过利用多模态连贯关系实现了零样本多模态讽刺检测的最优性能,证明了将语言学和认知原理融入多模态推理设计的重要性。
English Summary: IRONIC is a novel in-context learning framework that uses multi-modal coherence relations to achieve state-of-the-art zero-shot sarcasm detection, demonstrating the importance of integrating linguistic and cognitive principles into multi-modal reasoning.

Authors:Wei Zhang, Zhenhong Zhou, Kun Wang, Junfeng Fang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xavier Li, Li Sun, Lingjuan Lyu, Yang Liu, Sen Su
Title: LIFEBench: Evaluating Length Instruction Following in Large Language Models
Abstract:
While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.
Chinese: 尽管大语言模型能解决博士级别的复杂推理问题,却常无法遵循明确的篇幅指令,为此推出的LIFEBench通过多任务评估揭示了它们在各类字数要求下的根本缺陷。
English: Despite excelling at complex reasoning tasks, large language models frequently fail to adhere to explicit length instructions, prompting the creation of LIFEBench to evaluate and reveal their limitations across diverse tasks and word counts.

Authors:Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
Title: AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Abstract:
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
Chinese: 本文提出AudioTrust框架,通过涵盖六个关键维度的系统性评估方法,针对音频大语言模型的信任度进行测试,发现14种先进模型在面对4,420多个真实场景音频样本时存在显著缺陷。
English: This paper introduces AudioTrust, a comprehensive framework designed to systematically evaluate the trustworthiness of Audio Large Language Models (ALLMs) by addressing audio-specific risks across six key dimensions, revealing significant vulnerabilities in 14 state-of-the-art models when tested with over 4,420 real-world audio samples.

Authors:Yuqing Yang, Robin Jia
Title: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
Abstract:
Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.
中文: 本研究探讨大型语言模型何时及为何撤回错误答案,发现撤回行为罕见且与模型对事实正确性的内部信念存在因果关联,监督微调可通过优化这些信念显著提升撤回表现。
English: This study investigates when and why large language models (LLMs) retract incorrect answers, finding that retraction is rare and causally linked to the model's internal belief about factual correctness, with supervised fine-tuning shown to improve performance by refining these beliefs.

Authors:Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Yang Gao, Heyan Huang
Title: EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios
Abstract:
As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at https://github.com/ybai-nlp/EduBench.
中文: 本文推出首个教育场景多样化基准EduBench,涵盖9大场景的合成数据和多维评估指标,通过人工标注验证有效性,并成功训练出性能媲美顶尖大模型的小型教育语言模型。
English: This paper introduces EduBench, the first diverse benchmark for educational language models, featuring synthetic data across 9 scenarios and multi-dimensional metrics validated through human annotation, with a trained small model matching top large models' performance.

Authors:Zhenglin Hua, Jinghan He, Zijun Yao, Tianxu Han, Haiyun Guo, Yuheng Jia, Junfeng Fang
Title: Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
Abstract:
Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with faithfulness or hallucination, extracting more precise and disentangled hallucination-related representations. Our analysis demonstrates that interventions along the identified faithful direction can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a plug-and-play method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead. The code is available at https://github.com/huazhenglin2003/SSL.
中文摘要:本文提出SSL方法,通过稀疏自编码器识别并调控大视觉语言模型中的潜在语义方向,在可忽略的额外时间成本下有效缓解幻觉现象,同时保持语义完整性和跨模型架构的迁移能力。
English Summary: This paper introduces SSL, a plug-and-play method using sparse autoencoders to identify and steer latent directions in large vision-language models, effectively reducing hallucinations while maintaining semantic integrity and transferability across architectures with minimal computational overhead.

Authors:Hyang Cui
Title: LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods
Abstract:
Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.
Chinese: 本文提出了一种基于生成的机器翻译质量评估方法,利用仅解码器大语言模型生成高质量参考译文,并通过句子嵌入进行语义相似度评分,其表现优于现有评分基准和外部无参考指标。
English: This paper introduces a generation-based evaluation method for machine translation quality estimation that uses decoder-only large language models to create references and assesses semantic similarity with sentence embeddings, outperforming existing scoring baselines and metrics.

Authors:Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen
Title: Continually Self-Improving Language Models for Bariatric Surgery Question--Answering
Abstract:
While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.
中文: 减重与代谢手术需多学科全程协作,但医疗差异常阻碍患者获取可靠信息,为此开发的bRAGgen自适应AI模型能动态整合实时医学证据,确保回答准确及时,经专家验证表现显著优于现有模型。
English: Bariatric and metabolic surgery requires continuous multidisciplinary care, but healthcare disparities often limit access to reliable information, prompting the development of bRAGgen, an adaptive AI model that integrates real-time medical evidence to provide accurate, up-to-date responses, validated as superior through expert evaluation.

Authors:Ziqing Wang, Kexin Zhang, Zihan Zhao, Yibo Wen, Abhishek Pandey, Han Liu, Kaize Ding
Title: A Survey of Large Language Models for Text-Guided Molecular Discovery: from Molecule Generation to Optimization
Abstract:
Large language models (LLMs) are introducing a paradigm shift in molecular discovery by enabling text-guided interaction with chemical spaces through natural language, symbolic notations, with emerging extensions to incorporate multi-modal inputs. To advance the new field of LLM for molecular discovery, this survey provides an up-to-date and forward-looking review of the emerging use of LLMs for two central tasks: molecule generation and molecule optimization. Based on our proposed taxonomy for both problems, we analyze representative techniques in each category, highlighting how LLM capabilities are leveraged across different learning settings. In addition, we include the commonly used datasets and evaluation protocols. We conclude by discussing key challenges and future directions, positioning this survey as a resource for researchers working at the intersection of LLMs and molecular science. A continuously updated reading list is available at https://github.com/REAL-Lab-NU/Awesome-LLM-Centric-Molecular-Discovery.
大语言模型通过文本引导探索化学空间,并支持分子生成与优化等核心任务,正在彻底改变分子发现领域,本综述对此进行了系统梳理与前瞻展望。
Large language models are revolutionizing molecular discovery by enabling text-guided exploration of chemical spaces and supporting key tasks like molecule generation and optimization, as detailed in this comprehensive survey.

Authors:Gagan Bhatia, Maxime Peyrard, Wei Zhao
Title: Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
Abstract:
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day). Our datasets and code are made publicly available \href{https://github.com/gagan3012/date-fragments}{here}.
中文: 现代BPE分词器将日期分割成无意义片段,影响时间推理,本研究提出了日期碎片化度量标准、测试基准,并揭示了大语言模型通过涌现的抽象机制重组这些片段的过程。
English: Modern BPE tokenizers fragment calendar dates into meaningless pieces, impairing temporal reasoning, but this work introduces a date fragmentation metric, a benchmark for testing, and reveals how large language models reassemble these fragments through an emergent abstraction mechanism.

Authors:Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, Wenhu Chen
Title: Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
Abstract:
Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84\% on V* bench, 74\% on TallyQA-Complex, and 84\% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

Authors:Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Title: Pre-training Large Memory Language Models with Internal and External Knowledge
Abstract:
Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
Chinese: 有限记忆语言模型(LMLM)在预训练期间将事实知识外部化存储到数据库,既实现了与大型模型相媲美的性能,又提供了可编辑和可验证的知识库。
English: Limited Memory Language Models (LMLM) externalize factual knowledge to databases during pre-training, enabling competitive performance with larger models while providing editable and verifiable knowledge bases.

Authors:Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
Title: Training Step-Level Reasoning Verifiers with Formal Verification Tools
Abstract:
Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.
中文摘要:本文提出FoVer方法,利用形式化验证工具自动生成过程奖励模型的步骤级训练数据,无需人工标注即可实现跨任务泛化,并在多个推理基准测试中达到领先性能。
English Summary: This paper introduces FoVer, an automated method using formal verification tools to generate step-level training data for Process Reward Models (PRMs), enabling cross-task generalization and achieving state-of-the-art performance across diverse reasoning benchmarks without human annotation.

Authors:Chih-Kai Yang, Neo S. Ho, Hung-yi Lee
Title: Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
Abstract:
With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.
Chinese: 本研究针对大型音频语言模型评估标准零散的问题,首次提出系统化分类框架,涵盖听觉处理、知识推理、对话能力和伦理安全四大维度,为领域发展提供首个全面评估指南与资源库。
English: This survey introduces a systematic taxonomy for evaluating large audio-language models across four dimensions—auditory processing, knowledge reasoning, dialogue ability, and ethical safety—addressing fragmented benchmarks and providing the first comprehensive evaluation framework for the field.

Authors:Tony Montes, Fernando Lozano
Title: ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation
Abstract:
Recent advancements in Video Question Answering (VideoQA) have introduced LLM-based agents, modular frameworks, and procedural solutions, yielding promising results. These systems use dynamic agents and memory-based mechanisms to break down complex tasks and refine answers. However, significant improvements remain in tracking objects for grounding over time and decision-making based on reasoning to better align object references with language model outputs, as newer models get better at both tasks. This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA) that combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment. This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks. Our framework also enables cross-checking of grounding timeframes, improving accuracy and providing valuable support for verification and increased output reliability across multiple video domains. The code is available at https://github.com/t-montes/viqagent.
中文摘要:本研究提出了一种基于大语言模型的零样本视频问答智能体,通过结合思维链推理与YOLO-World目标追踪技术,在多个基准测试中实现最优性能,同时支持时间定位交叉验证以提升输出可靠性。
English Summary: This work introduces an LLM-brained agent for zero-shot VideoQA that integrates Chain-of-Thought reasoning with YOLO-World object tracking, achieving state-of-the-art performance across multiple benchmarks while enabling cross-verification of grounding timeframes for enhanced reliability.

Authors:Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, Jun Xu
Title: GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents
Abstract:
Recent Graphical User Interface (GUI) agents replicate the R1-Zero paradigm, coupling online Reinforcement Learning (RL) with explicit chain-of-thought reasoning prior to object grounding and thereby achieving substantial performance gains. In this paper, we first conduct extensive analysis experiments of three key components of that training pipeline: input design, output evaluation, and policy update-each revealing distinct challenges arising from blindly applying general-purpose RL without adapting to GUI grounding tasks. Input design: Current templates encourage the model to generate chain-of-thought reasoning, but longer chains unexpectedly lead to worse grounding performance. Output evaluation: Reward functions based on hit signals or box area allow models to exploit box size, leading to reward hacking and poor localization quality. Policy update: Online RL tends to overfit easy examples due to biases in length and sample difficulty, leading to under-optimization on harder cases. To address these issues, we propose three targeted solutions. First, we adopt a Fast Thinking Template that encourages direct answer generation, reducing excessive reasoning during training. Second, we incorporate a box size constraint into the reward function to mitigate reward hacking. Third, we revise the RL objective by adjusting length normalization and adding a difficulty-aware scaling factor, enabling better optimization on hard samples. Our GUI-G1-3B, trained on 17K public samples with Qwen2.5-VL-3B-Instruct, achieves 90.3% accuracy on ScreenSpot and 37.1% on ScreenSpot-Pro. This surpasses all prior models of similar size and even outperforms the larger UI-TARS-7B, establishing a new state-of-the-art in GUI agent grounding. The project repository is available at https://github.com/Yuqi-Zhou/GUI-G1.
中文摘要:本文分析了图形用户界面智能体训练中的三大挑战,并提出快速思考模板、边界框约束和优化强化学习目标三项针对性解决方案,使GUI-G1-3B模型在界面定位任务中达到最新最优性能。
English Summary: This paper analyzes challenges in GUI agent training and proposes three targeted solutions—a Fast Thinking Template, box size constraints, and a revised RL objective—enabling their GUI-G1-3B model to achieve state-of-the-art grounding performance.

Authors:Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai, Yang Liu, Xin Xu, Mengdi Zhang, Jian Shao, Yongliang Shen, Jun Xiao, Yueting Zhuang
Title: VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Abstract:
Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.
Chinese: 本文提出了VerifyBench和VerifyBench-Hard两个基准测试,专门用于评估推理模型中基于参考的奖励系统,填补了现有验证器评估的空白,并揭示了特别是较小模型在验证器准确性方面仍需显著提升的空间。
English: This paper introduces VerifyBench and VerifyBench-Hard, two benchmarks designed to evaluate reference-based reward systems for reinforcement learning in reasoning models, addressing current gaps in verifier assessment and revealing significant improvement opportunities especially for smaller models.

Authors:Danna Zheng, Mirella Lapata, Jeff Z. Pan
Title: Long-Form Information Alignment Evaluation Beyond Atomic Facts
Abstract:
Information alignment evaluators are vital for various NLG evaluation tasks and trustworthy LLM deployment, reducing hallucinations and enhancing user trust. Current fine-grained methods, like FactScore, verify facts individually but neglect inter-fact dependencies, enabling subtle vulnerabilities. In this work, we introduce MontageLie, a challenging benchmark that constructs deceptive narratives by "montaging" truthful statements without introducing explicit hallucinations. We demonstrate that both coarse-grained LLM-based evaluators and current fine-grained frameworks are susceptible to this attack, with AUC-ROC scores falling below 65%. To enable more robust fine-grained evaluation, we propose DoveScore, a novel framework that jointly verifies factual accuracy and event-order consistency. By modeling inter-fact relationships, DoveScore outperforms existing fine-grained methods by over 8%, providing a more robust solution for long-form text alignment evaluation. Our code and datasets are available at https://github.com/dannalily/DoveScore.
中文: 本文提出MontageLie基准测试,通过组合真实陈述构建欺骗性叙述来揭示现有事实核查方法的脆弱性,并推出DoveScore框架,通过联合验证事实准确性和事件顺序一致性,显著提升了评估鲁棒性。
English: This paper introduces MontageLie, a benchmark revealing vulnerabilities in current fact-checking methods by creating deceptive narratives from truthful statements, and proposes DoveScore, a robust framework that improves evaluation accuracy by verifying both factual accuracy and event-order consistency.

Authors:Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang
Title: dKV-Cache: The Cache for Diffusion Language Models
Abstract:
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.
中文摘要:扩散语言模型通过提出延迟KV缓存机制,实现了2-10倍的推理加速,在保持甚至提升多项语言任务性能的同时,显著缩小了与自回归模型的速度差距。
English Summary: Diffusion Language Models (DLMs) have overcome their slow inference limitation through a novel delayed KV-Cache mechanism that achieves 2-10x speedup while maintaining or even enhancing performance on various language tasks.

Authors:Zhen Zhang, Xuehai He, Weixiang Yan, Ao Shen, Chenyang Zhao, Shuohang Wang, Yelong Shen, Xin Eric Wang
Title: Soft Thinking: Unlocking the Reasoning Potential of LLMs in Continuous Concept Space
Abstract:
Human cognition typically involves thinking through abstract, fluid concepts rather than strictly using discrete linguistic tokens. Current reasoning models, however, are constrained to reasoning within the boundaries of human language, processing discrete token embeddings that represent fixed points in the semantic space. This discrete constraint restricts the expressive power and upper potential of such reasoning models, often causing incomplete exploration of reasoning paths, as standard Chain-of-Thought (CoT) methods rely on sampling one token per step. In this work, we introduce Soft Thinking, a training-free method that emulates human-like "soft" reasoning by generating soft, abstract concept tokens in a continuous concept space. These concept tokens are created by the probability-weighted mixture of token embeddings, which form the continuous concept space, enabling smooth transitions and richer representations that transcend traditional discrete boundaries. In essence, each generated concept token encapsulates multiple meanings from related discrete tokens, implicitly exploring various reasoning paths to converge effectively toward the correct answer. Empirical evaluations on diverse mathematical and coding benchmarks consistently demonstrate the effectiveness and efficiency of Soft Thinking, improving pass@1 accuracy by up to 2.48 points while simultaneously reducing token usage by up to 22.4% compared to standard CoT. Qualitative analysis further reveals that Soft Thinking outputs remain highly interpretable and readable, highlighting the potential of Soft Thinking to break the inherent bottleneck of discrete language-based reasoning. Code is available at https://github.com/eric-ai-lab/Soft-Thinking.
中文: Soft Thinking是一种无需训练的方法,通过在连续概念空间中生成抽象概念标记来模拟人类推理,提高准确性和效率,同时保持可解释性。
English: Soft Thinking is a training-free method that enhances reasoning by generating abstract concept tokens in a continuous space, improving accuracy and efficiency while maintaining interpretability.

Authors:Peng Wang, Biyu Zhou, Xuehai Tang, Jizhong Han, Songlin Hu
Title: LyapLock: Bounded Knowledge Preservation in Sequential Large Language Model Editing
Abstract:
Large Language Models often contain factually incorrect or outdated knowledge, giving rise to model editing methods for precise knowledge updates. However, current mainstream locate-then-edit approaches exhibit a progressive performance decline during sequential editing, due to inadequate mechanisms for long-term knowledge preservation. To tackle this, we model the sequential editing as a constrained stochastic programming. Given the challenges posed by the cumulative preservation error constraint and the gradually revealed editing tasks, \textbf{LyapLock} is proposed. It integrates queuing theory and Lyapunov optimization to decompose the long-term constrained programming into tractable stepwise subproblems for efficient solving. This is the first model editing framework with rigorous theoretical guarantees, achieving asymptotic optimal editing performance while meeting the constraints of long-term knowledge preservation. Experimental results show that our framework scales sequential editing capacity to over 10,000 edits while stabilizing general capabilities and boosting average editing efficacy by 11.89\% over SOTA baselines. Furthermore, it can be leveraged to enhance the performance of baseline methods. Our code is released on https://github.com/caskcsg/LyapLock.
中文: 大语言模型常包含错误知识,为此提出的LyapLock框架结合排队论和李雅普诺夫优化,将顺序编辑分解为可处理的子问题,在确保长期知识保留的同时,将顺序编辑能力扩展至超过10,000次,并将编辑效果提升11.89%。
English: Large Language Models often contain incorrect knowledge, prompting the development of LyapLock, a novel model editing framework that uses queuing theory and Lyapunov optimization to efficiently handle over 10,000 sequential edits while ensuring long-term knowledge preservation and boosting editing efficacy by 11.89%.

Authors:Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Shang Wu, Yu Cao, Caiwen Ding, Yang, Zhao
Title: HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Abstract:
Large Language Models (LLMs) have demonstrated their potential in hardware design tasks, such as Hardware Description Language (HDL) generation and debugging. Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered. To this end, we propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs, introducing HDL-specific graph representations by incorporating Abstract Syntax Trees (ASTs) and Data Flow Graphs (DFGs) to capture both code graph view and hardware graph view. HDLxGraph utilizes a dual-retrieval mechanism that not only mitigates the limited recall issues inherent in similarity-based semantic retrieval by incorporating structural information, but also enhances its extensibility to various real-world tasks by a task-specific retrieval finetuning. Additionally, to address the lack of comprehensive HDL search benchmarks, we introduce HDLSearch, a multi-granularity evaluation dataset derived from real-world repository-level projects. Experimental results demonstrate that HDLxGraph significantly improves average search accuracy, debugging efficiency and completion quality by 12.04%, 12.22% and 5.04% compared to similarity-based RAG, respectively. The code of HDLxGraph and collected HDLSearch benchmark are available at https://github.com/Nick-Zheng-Q/HDLxGraph.
中文摘要:HDLxGraph是一个创新框架,将图检索增强生成与大语言模型相结合,通过硬件专用图表示显著提升了现实硬件设计任务中的代码搜索、调试和完成性能。
English Summary: HDLxGraph is a novel framework that integrates Graph Retrieval Augmented Generation with Large Language Models, using hardware-specific graph representations to significantly enhance performance in real-world hardware design tasks such as code search, debugging, and completion.

Authors:Zhexin Zhang, Yuhao Sun, Junxiao Yang, Shiyao Cui, Hongning Wang, Minlie Huang
Title: Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
Abstract:
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.
Chinese: 使用专有数据对开源大语言模型进行微调存在严重风险,模型创建者可通过简单的后门训练提取下游私有数据,在理想条件下提取率高达94.9%。
English: Fine-tuning open-source large language models with proprietary data poses a significant risk, as creators can extract private downstream data through simple backdoor training, achieving extraction rates as high as 94.9% in ideal settings.

Authors:Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He
Title: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
Abstract:
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.
中文: 本文提出LASER及其增强版LASER-D方法,通过基于长度的奖励塑造机制减少大推理模型的输出冗余,在显著压缩推理过程的同时实现了更优的性能与效率平衡。
English: This paper introduces LASER and its enhanced version LASER-D, RL-based methods that use length-based reward shaping to reduce redundancy in Large Reasoning Models, achieving superior efficiency and performance with significantly shorter reasoning traces.

Authors:David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, Mrinmaya Sachan
Title: From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning
Abstract:
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy which requires strategically withholding answers. To mitigate this, we propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors using simulated student-tutor interactions by emphasizing pedagogical quality and guided problem-solving over simply giving away answers. We use our method to train a 7B parameter tutor model without human annotations which reaches similar performance to larger proprietary models like LearnLM. We introduce a controllable reward weighting to balance pedagogical support and student solving accuracy, allowing us to trace the Pareto frontier between these two objectives. Our models better preserve reasoning capabilities than single-turn SFT baselines and can optionally enhance interpretability through thinking tags that expose the model's instructional planning.
中文摘要:本研究提出一种基于强化学习的对齐框架,通过强调引导式解题而非直接给答案,将大语言模型快速适配为高效辅导教师,在无需人工标注的情况下使70亿参数模型达到与专业模型相当的教学效果,同时保持推理能力并可通过思维标签增强教学策略的可解释性。
English Summary: This study introduces a reinforcement learning framework that aligns large language models with effective tutoring principles by prioritizing guided problem-solving over direct answers, achieving comparable performance to proprietary models while preserving reasoning capabilities and offering interpretability through instructional planning tags.

Authors:Haocheng Ju, Bin Dong
Title: MIRB: Mathematical Information Retrieval Benchmark
Abstract:
Mathematical Information Retrieval (MIR) is the task of retrieving information from mathematical documents and plays a key role in various applications, including theorem search in mathematical libraries, answer retrieval on math forums, and premise selection in automated theorem proving. However, a unified benchmark for evaluating these diverse retrieval tasks has been lacking. In this paper, we introduce MIRB (Mathematical Information Retrieval Benchmark) to assess the MIR capabilities of retrieval models. MIRB includes four tasks: semantic statement retrieval, question-answer retrieval, premise retrieval, and formula retrieval, spanning a total of 12 datasets. We evaluate 13 retrieval models on this benchmark and analyze the challenges inherent to MIR. We hope that MIRB provides a comprehensive framework for evaluating MIR systems and helps advance the development of more effective retrieval models tailored to the mathematical domain.
中文: 本文提出MIRB基准,通过四项任务和12个数据集统一评估数学信息检索系统,填补了标准化测评空白,并对13种模型进行分析以推动数学领域检索技术的发展。
English: This paper introduces MIRB, a unified benchmark for evaluating Mathematical Information Retrieval across four tasks and 12 datasets, addressing the lack of standardized assessment and analyzing 13 models to advance domain-specific retrieval systems.

Authors:Yiyun Zhou, Chang Yao, Jingyuan Chen
Title: CoLA: Collaborative Low-Rank Adaptation
Abstract:
The scaling law of Large Language Models (LLMs) reveals a power-law relationship, showing diminishing return on performance as model scale increases. While training LLMs from scratch is resource-intensive, fine-tuning a pre-trained model for specific tasks has become a practical alternative. Full fine-tuning (FFT) achieves strong performance; however, it is computationally expensive and inefficient. Parameter-efficient fine-tuning (PEFT) methods, like LoRA, have been proposed to address these challenges by freezing the pre-trained model and adding lightweight task-specific modules. LoRA, in particular, has proven effective, but its application to multi-task scenarios is limited by interference between tasks. Recent approaches, such as Mixture-of-Experts (MOE) and asymmetric LoRA, have aimed to mitigate these issues but still struggle with sample scarcity and noise interference due to their fixed structure. In response, we propose CoLA, a more flexible LoRA architecture with an efficient initialization scheme, and introduces three collaborative strategies to enhance performance by better utilizing the quantitative relationships between matrices $A$ and $B$. Our experiments demonstrate the effectiveness and robustness of CoLA, outperforming existing PEFT methods, especially in low-sample scenarios. Our data and code are fully publicly available at https://github.com/zyy-2001/CoLA.
中文摘要:提出的CoLA架构通过优化初始化和协作策略增强了LoRA的灵活性与效率,在低样本场景下相比现有参数高效微调方法展现出更优性能。
English Summary: The proposed CoLA architecture enhances LoRA's flexibility and efficiency through optimized initialization and collaborative strategies, demonstrating superior performance in low-sample scenarios compared to existing parameter-efficient fine-tuning methods.

Authors:Zhexin Zhang, Xian Qi Loye, Victor Shea-Jay Huang, Junxiao Yang, Qi Zhu, Shiyao Cui, Fei Mi, Lifeng Shang, Yingkang Wang, Hongning Wang, Minlie Huang
Title: How Should We Enhance the Safety of Large Reasoning Models: An Empirical Study
Abstract:
Large Reasoning Models (LRMs) have achieved remarkable success on reasoning-intensive tasks such as mathematics and programming. However, their enhanced reasoning capabilities do not necessarily translate to improved safety performance-and in some cases, may even degrade it. This raises an important research question: how can we enhance the safety of LRMs? In this paper, we present a comprehensive empirical study on how to enhance the safety of LRMs through Supervised Fine-Tuning (SFT). Our investigation begins with an unexpected observation: directly distilling safe responses from DeepSeek-R1 fails to significantly enhance safety. We analyze this phenomenon and identify three key failure patterns that contribute to it. We then demonstrate that explicitly addressing these issues during the data distillation process can lead to substantial safety improvements. Next, we explore whether a long and complex reasoning process is necessary for achieving safety. Interestingly, we find that simply using short or template-based reasoning process can attain comparable safety performance-and are significantly easier for models to learn than more intricate reasoning chains. These findings prompt a deeper reflection on the role of reasoning in ensuring safety. Finally, we find that mixing math reasoning data during safety fine-tuning is helpful to balance safety and over-refusal. Overall, we hope our empirical study could provide a more holistic picture on enhancing the safety of LRMs. The code and data used in our experiments are released in https://github.com/thu-coai/LRM-Safety-Study.
中文: 本研究通过监督微调发现,解决关键失效模式并使用简化推理过程可显著提升大型推理模型的安全性,同时混合数学推理数据有助于在安全性和过度拒绝之间取得平衡。
English: This study reveals that supervised fine-tuning can significantly improve the safety of Large Reasoning Models by addressing key failure patterns and using simplified reasoning processes, while balancing safety with reasoning capabilities through mixed training data.

Authors:DongGeon Lee, Joonwon Jang, Jihae Jeong, Hwanjo Yu
Title: Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
Abstract:
Rapid deployment of vision-language models (VLMs) magnifies safety risks, yet most evaluations rely on artificial images. This study asks: How safe are current VLMs when confronted with meme images that ordinary users share? To investigate this question, we introduce MemeSafetyBench, a 50,430-instance benchmark pairing real meme images with both harmful and benign instructions. Using a comprehensive safety taxonomy and LLM-based instruction generation, we assess multiple VLMs across single and multi-turn interactions. We investigate how real-world memes influence harmful outputs, the mitigating effects of conversational context, and the relationship between model scale and safety metrics. Our findings demonstrate that VLMs are more vulnerable to meme-based harmful prompts than to synthetic or typographic images. Memes significantly increase harmful responses and decrease refusals compared to text-only inputs. Though multi-turn interactions provide partial mitigation, elevated vulnerability persists. These results highlight the need for ecologically valid evaluations and stronger safety mechanisms. MemeSafetyBench is publicly available at https://github.com/oneonlee/Meme-Safety-Bench.
Chinese: 研究表明,视觉语言模型在面对真实表情包图像的有害指令时比面对人工图像时更易受影响,凸显了进行生态效度安全评估和加强防护机制的迫切需求。
English: This study reveals that vision-language models are significantly more vulnerable to harmful prompts from real meme images than artificial ones, highlighting the urgent need for ecologically valid safety evaluations and enhanced protective measures.

Authors:Yanzhi Tian, Zeming Liu, Zhengyang Liu, Yuhang Guo
Title: Exploring In-Image Machine Translation with Real-World Background
Abstract:
In-Image Machine Translation (IIMT) aims to translate texts within images from one language to another. Previous research on IIMT was primarily conducted on simplified scenarios such as images of one-line text with black font in white backgrounds, which is far from reality and impractical for applications in the real world. To make IIMT research practically valuable, it is essential to consider a complex scenario where the text backgrounds are derived from real-world images. To facilitate research of complex scenario IIMT, we design an IIMT dataset that includes subtitle text with real-world background. However previous IIMT models perform inadequately in complex scenarios. To address the issue, we propose the DebackX model, which separates the background and text-image from the source image, performs translation on text-image directly, and fuses the translated text-image with the background, to generate the target image. Experimental results show that our model achieves improvements in both translation quality and visual effect.
Chinese: DebackX模型通过从复杂现实背景中分离并直接翻译文本再融合,提升了图像内机器翻译的翻译质量和视觉效果。
English: The DebackX model enhances In-Image Machine Translation by separating and translating text from complex real-world backgrounds before fusing it back, achieving superior translation quality and visual results.

Authors:Zihao Jiang, Ben Liu, Miao Peng, Wenjie Xu, Yao Xiao, Zhenyan Shan, Min Peng
Title: Towards Explainable Temporal Reasoning in Large Language Models: A Structure-Aware Generative Framework
Abstract:
While large language models (LLMs) show great potential in temporal reasoning, most existing work focuses heavily on enhancing performance, often neglecting the explainable reasoning processes underlying the results. To address this gap, we introduce a comprehensive benchmark covering a wide range of temporal granularities, designed to systematically evaluate LLMs' capabilities in explainable temporal reasoning. Furthermore, our findings reveal that LLMs struggle to deliver convincing explanations when relying solely on textual information. To address challenge, we propose GETER, a novel structure-aware generative framework that integrates Graph structures with text for Explainable TEmporal Reasoning. Specifically, we first leverage temporal knowledge graphs to develop a temporal encoder that captures structural information for the query. Subsequently, we introduce a structure-text prefix adapter to map graph structure features into the text embedding space. Finally, LLMs generate explanation text by seamlessly integrating the soft graph token with instruction-tuning prompt tokens. Experimental results indicate that GETER achieves state-of-the-art performance while also demonstrating its effectiveness as well as strong generalization capabilities. Our dataset and code are available at https://github.com/carryTatum/GETER.
中文摘要:本文提出了GETER框架,通过将图结构与文本相结合来增强大语言模型的可解释时序推理能力,解决了其仅依赖文本时解释力不足的问题,并实现了最先进的性能表现。
English Summary: This paper introduces GETER, a novel framework that integrates graph structures with text to enhance explainable temporal reasoning in large language models, addressing their limitations in generating convincing explanations and achieving state-of-the-art performance.

Authors:Jie Ma, Ning Qu, Zhitao Gao, Rui Xing, Jun Liu, Hongbin Pei, Jiang Xie, Linyun Song, Pinghui Wang, Jing Tao, Zhou Su
Title: Deliberation on Priors: Trustworthy Reasoning of Large Language Models on Knowledge Graphs
Abstract:
Knowledge graph-based retrieval-augmented generation seeks to mitigate hallucinations in Large Language Models (LLMs) caused by insufficient or outdated knowledge. However, existing methods often fail to fully exploit the prior knowledge embedded in knowledge graphs (KGs), particularly their structural information and explicit or implicit constraints. The former can enhance the faithfulness of LLMs' reasoning, while the latter can improve the reliability of response generation. Motivated by these, we propose a trustworthy reasoning framework, termed Deliberation over Priors (DP), which sufficiently utilizes the priors contained in KGs. Specifically, DP adopts a progressive knowledge distillation strategy that integrates structural priors into LLMs through a combination of supervised fine-tuning and Kahneman-Tversky optimization, thereby improving the faithfulness of relation path generation. Furthermore, our framework employs a reasoning-introspection strategy, which guides LLMs to perform refined reasoning verification based on extracted constraint priors, ensuring the reliability of response generation. Extensive experiments on three benchmark datasets demonstrate that DP achieves new state-of-the-art performance, especially a Hit@1 improvement of 13% on the ComplexWebQuestions dataset, and generates highly trustworthy responses. We also conduct various analyses to verify its flexibility and practicality. The code is available at https://github.com/reml-group/Deliberation-on-Priors.
Chinese: 提出的"先验深思"框架通过渐进式知识蒸馏和推理自省策略,充分整合知识图谱中的结构先验与约束先验,在ComplexWebQuestions数据集上实现13%的Hit@1提升,显著增强了大型语言模型生成结果的可信度。
English: The proposed Deliberation over Priors framework enhances LLM trustworthiness by integrating structural and constraint knowledge from knowledge graphs through progressive distillation and reasoning-introspection, achieving state-of-the-art performance with a 13% Hit@1 improvement on ComplexWebQuestions.

Authors:Wonje Jeung, Sangyeon Yoon, Hyesoo Hong, Soeun Kim, Seungju Han, Youngjae Yu, Albert No
Title: DUSK: Do Not Unlearn Shared Knowledge
Abstract:
Large language models (LLMs) are increasingly deployed in real-world applications, raising concerns about the unauthorized use of copyrighted or sensitive data. Machine unlearning aims to remove such 'forget' data while preserving utility and information from the 'retain' set. However, existing evaluations typically assume that forget and retain sets are fully disjoint, overlooking realistic scenarios where they share overlapping content. For instance, a news article may need to be unlearned, even though the same event, such as an earthquake in Japan, is also described factually on Wikipedia. Effective unlearning should remove the specific phrasing of the news article while preserving publicly supported facts. In this paper, we introduce DUSK, a benchmark designed to evaluate unlearning methods under realistic data overlap. DUSK constructs document sets that describe the same factual content in different styles, with some shared information appearing across all sets and other content remaining unique to each. When one set is designated for unlearning, an ideal method should remove its unique content while preserving shared facts. We define seven evaluation metrics to assess whether unlearning methods can achieve this selective removal. Our evaluation of nine recent unlearning methods reveals a key limitation: while most can remove surface-level text, they often fail to erase deeper, context-specific knowledge without damaging shared content. We release DUSK as a public benchmark to support the development of more precise and reliable unlearning techniques for real-world applications.

Authors:Bowen Jin, Jinsung Yoon, Priyanka Kargupta, Sercan O. Arik, Jiawei Han
Title: An Empirical Study on Reinforcement Learning for Reasoning-Search Interleaved LLM Agents
Abstract:
Reinforcement learning (RL) has demonstrated strong potential in training large language models (LLMs) capable of complex reasoning for real-world problem solving. More recently, RL has been leveraged to create sophisticated LLM-based search agents that adeptly combine reasoning with search engine use. While the use of RL for training search agents is promising, the optimal design of such agents remains not fully understood. In particular, key factors -- such as (1) reward formulation, (2) the choice and characteristics of the underlying LLM, and (3) the role of the search engine in the RL process -- require further investigation. In this work, we conduct comprehensive empirical studies to systematically investigate these and offer actionable insights. We highlight several key findings: format rewards are effective in improving final performance, whereas intermediate retrieval rewards have limited impact; the scale and initialization of the LLM (general-purpose vs. reasoning-specialized) significantly influence RL outcomes; and the choice of search engine plays a critical role in shaping RL training dynamics and the robustness of the trained agent during inference. These establish important guidelines for successfully building and deploying LLM-based search agents in real-world applications. Code is available at https://github.com/PeterGriffinJin/Search-R1.
中文摘要:强化学习能有效训练大语言模型开发结合推理与搜索引擎的智能搜索代理,其中奖励设计、模型选择和搜索引擎等关键因素显著影响性能与鲁棒性,为实际应用提供了重要指导。
English Summary: Reinforcement learning effectively trains large language models to create search agents that integrate reasoning with search engines, with key factors like reward design, model choice, and search engine selection critically impacting performance and robustness.

Authors:Ziliang Wang, Xuhui Zheng, Kang An, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu
Title: StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization
Abstract:
Efficient multi-hop reasoning requires Large Language Models (LLMs) based agents to acquire high-value external knowledge iteratively. Previous work has explored reinforcement learning (RL) to train LLMs to perform search-based document retrieval, achieving notable improvements in QA performance, but underperform on complex, multi-hop QA resulting from the sparse rewards from global signal only. To address this gap in existing research, we introduce StepSearch, a framework for search LLMs that trained with step-wise proximal policy optimization method. It consists of richer and more detailed intermediate search rewards and token-level process supervision based on information gain and redundancy penalties to better guide each search step. We constructed a fine-grained question-answering dataset containing sub-question-level search trajectories based on open source datasets through a set of data pipeline method. On standard multi-hop QA benchmarks, it significantly outperforms global-reward baselines, achieving 11.2% and 4.2% absolute improvements for 3B and 7B models over various search with RL baselines using only 19k training data, demonstrating the effectiveness of fine-grained, stepwise supervision in optimizing deep search LLMs. Our code will be released on https://github.com/Zillwang/StepSearch.
中文: StepSearch框架通过引入基于信息增益和冗余惩罚的逐步奖励机制与过程监督,有效优化了大型语言模型在复杂多跳问答中的搜索能力,仅用少量训练数据即在标准测试中显著超越现有强化学习方法。
English: StepSearch is a novel framework that enhances multi-hop reasoning in LLMs by employing step-wise proximal policy optimization with detailed intermediate rewards and token-level supervision, significantly outperforming existing methods on complex QA benchmarks with minimal training data.

Authors:Zhiyu Shen, Jiyuan Liu, Yunhe Pang, Yanghui Rao
Title: HopWeaver: Synthesizing Authentic Multi-Hop Questions Across Text Corpora
Abstract:
Multi-Hop Question Answering (MHQA) is crucial for evaluating the model's capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first automatic framework synthesizing authentic multi-hop questions from unstructured text corpora without human intervention. HopWeaver synthesizes two types of multi-hop questions (bridge and comparison) using an innovative approach that identifies complementary documents across corpora. Its coherent pipeline constructs authentic reasoning paths that integrate information across multiple documents, ensuring synthesized questions necessitate authentic multi-hop reasoning. We further present a comprehensive system for evaluating synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our approach is valuable for developing MHQA datasets in specialized domains with scarce annotated resources. The code for HopWeaver is publicly available.
Chinese: HopWeaver是一种创新的跨文档框架,无需人工干预即可自动生成真实的多跳问题,以更低成本产出与人工标注数据集质量相当的高质量基准。
English: HopWeaver is an innovative cross-document framework that automatically generates authentic multi-hop questions without human intervention, producing high-quality benchmarks comparable to human-annotated datasets at lower cost.

Authors:Yuhang Zhou, Jing Zhu, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang
Title: DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data
Abstract:
Large Language Models (LLMs) are increasingly aligned with human preferences through Reinforcement Learning from Human Feedback (RLHF). Among RLHF methods, Group Relative Policy Optimization (GRPO) has gained attention for its simplicity and strong performance, notably eliminating the need for a learned value function. However, GRPO implicitly assumes a balanced domain distribution and uniform semantic alignment across groups, assumptions that rarely hold in real-world datasets. When applied to multi-domain, imbalanced data, GRPO disproportionately optimizes for dominant domains, neglecting underrepresented ones and resulting in poor generalization and fairness. We propose Domain-Informed Self-Consistency Policy Optimization (DISCO), a principled extension to GRPO that addresses inter-group imbalance with two key innovations. Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence. Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value. Together, these strategies promote more equitable and effective policy learning across domains. Extensive experiments across multiple LLMs and skewed training distributions show that DISCO improves generalization, outperforms existing GRPO variants by 5% on Qwen3 models, and sets new state-of-the-art results on multi-domain alignment benchmarks. Our code and data are available at https://github.com/Tonyzhou98/disco_grpo.
中文: 提出的DISCO方法通过引入领域感知和难度感知的奖励缩放机制,有效解决GRPO在多领域数据中的群体不平衡问题,显著提升模型的泛化能力与公平性,并在多领域对齐基准测试中创下最新最优表现。
English: The proposed DISCO method enhances GRPO by incorporating domain-aware and difficulty-aware reward scaling to address inter-group imbalance, improving generalization and fairness across multi-domain datasets while achieving state-of-the-art performance.

Authors:Chen Huang, Junkai Luo, Xinzuo Wang, Wenqiang Lei, Jiancheng Lv
Title: Can Large Language Models Understand Internet Buzzwords Through User-Generated Content
Abstract:
The massive user-generated content (UGC) available in Chinese social media is giving rise to the possibility of studying internet buzzwords. In this paper, we study if large language models (LLMs) can generate accurate definitions for these buzzwords based on UGC as examples. Our work serves a threefold contribution. First, we introduce CHEER, the first dataset of Chinese internet buzzwords, each annotated with a definition and relevant UGC. Second, we propose a novel method, called RESS, to effectively steer the comprehending process of LLMs to produce more accurate buzzword definitions, mirroring the skills of human language learning. Third, with CHEER, we benchmark the strengths and weaknesses of various off-the-shelf definition generation methods and our RESS. Our benchmark demonstrates the effectiveness of RESS while revealing crucial shared challenges: over-reliance on prior exposure, underdeveloped inferential abilities, and difficulty identifying high-quality UGC to facilitate comprehension. We believe our work lays the groundwork for future advancements in LLM-based definition generation. Our dataset and code are available at https://github.com/SCUNLP/Buzzword.
中文: 本文提出了首个中文网络流行语数据集CHEER和RESS方法,通过引导大语言模型理解用户生成内容来提升流行语定义生成准确性,同时评估了现有方法的优劣并揭示了关键挑战。
English: This paper introduces CHEER, the first Chinese internet buzzword dataset, and proposes RESS, a method to improve large language models' accuracy in generating buzzword definitions from user-generated content, while benchmarking existing approaches and identifying key challenges.

Authors:Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, Numaan Naeem, Muhammad Ahsan Riaz Khan, Arham Riaz, Muhammad Arslan Manzoor, Yuxia Wang, Preslav Nakov
Title: UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking
Abstract:
The rapid use of large language models (LLMs) has raised critical concerns regarding the factual reliability of their outputs, especially in low-resource languages such as Urdu. Existing automated fact-checking solutions overwhelmingly focus on English, leaving a significant gap for the 200+ million Urdu speakers worldwide. In this work, we introduce UrduFactCheck, the first comprehensive, modular fact-checking framework specifically tailored for Urdu. Our system features a dynamic, multi-strategy evidence retrieval pipeline that combines monolingual and translation-based approaches to address the scarcity of high-quality Urdu evidence. We curate and release two new hand-annotated benchmarks: UrduFactBench for claim verification and UrduFactQA for evaluating LLM factuality. Extensive experiments demonstrate that UrduFactCheck, particularly its translation-augmented variants, consistently outperforms baselines and open-source alternatives on multiple metrics. We further benchmark twelve state-of-the-art (SOTA) LLMs on factual question answering in Urdu, highlighting persistent gaps between proprietary and open-source models. UrduFactCheck's code and datasets are open-sourced and publicly available at https://github.com/mbzuai-nlp/UrduFactCheck.
Chinese: 本研究推出了首个针对乌尔都语的全面事实核查框架UrduFactCheck,通过采用多策略证据检索系统解决可靠信息稀缺问题,在新开发的基准测试中优于现有方法,同时评估了多种大语言模型在乌尔都语中的事实准确性。
English: This study introduces UrduFactCheck, the first comprehensive fact-checking framework for Urdu that addresses the scarcity of reliable information by employing a multi-strategy evidence retrieval system and outperforms existing methods on newly developed benchmarks, while also evaluating the factual accuracy of various LLMs in Urdu.

Authors:Kaiwen Zha, Zhengqi Gao, Maohao Shen, Zhang-Wei Hong, Duane S. Boning, Dina Katabi
Title: RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning
Abstract:
Reinforcement learning (RL) has recently emerged as a compelling approach for enhancing the reasoning capabilities of large language models (LLMs), where an LLM generator serves as a policy guided by a verifier (reward model). However, current RL post-training methods for LLMs typically use verifiers that are fixed (rule-based or frozen pretrained) or trained discriminatively via supervised fine-tuning (SFT). Such designs are susceptible to reward hacking and generalize poorly beyond their training distributions. To overcome these limitations, we propose Tango, a novel framework that uses RL to concurrently train both an LLM generator and a verifier in an interleaved manner. A central innovation of Tango is its generative, process-level LLM verifier, which is trained via RL and co-evolves with the generator. Importantly, the verifier is trained solely based on outcome-level verification correctness rewards without requiring explicit process-level annotations. This generative RL-trained verifier exhibits improved robustness and superior generalization compared to deterministic or SFT-trained verifiers, fostering effective mutual reinforcement with the generator. Extensive experiments demonstrate that both components of Tango achieve state-of-the-art results among 7B/8B-scale models: the generator attains best-in-class performance across five competition-level math benchmarks and four challenging out-of-domain reasoning tasks, while the verifier leads on the ProcessBench dataset. Remarkably, both components exhibit particularly substantial improvements on the most difficult mathematical reasoning problems. Code is at: https://github.com/kaiwenzha/rl-tango.
中文摘要:Tango提出了一种新颖的强化学习框架,通过协同训练LLM生成器和生成式验证器,在无需过程级标注的情况下实现两者相互增强,并在复杂推理任务上取得了最先进的性能。
English Summary: Tango introduces a novel reinforcement learning framework that co-trains an LLM generator and a generative verifier, achieving state-of-the-art performance on complex reasoning tasks through their mutual reinforcement without requiring process-level annotations.

Authors:Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle S. Bitterman
Title: MedBrowseComp: Benchmarking Medical Deep Research and Computer Use
Abstract:
Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands integrating heterogeneous knowledge bases -- trials, primary studies, regulatory documents, and cost data -- under strict accuracy constraints. Existing evaluations often rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended generation, leaving their real-world utility unclear. To close this gap, we present MedBrowseComp, the first benchmark that systematically tests an agent's ability to reliably retrieve and synthesize multi-hop medical facts from live, domain-specific knowledge bases. MedBrowseComp contains more than 1,000 human-curated questions that mirror clinical scenarios where practitioners must reconcile fragmented or conflicting information to reach an up-to-date conclusion. Applying MedBrowseComp to frontier agentic systems reveals performance shortfalls as low as ten percent, exposing a critical gap between current LLM capabilities and the rigor demanded in clinical settings. MedBrowseComp therefore offers a clear testbed for reliable medical information seeking and sets concrete goals for future model and toolchain upgrades. You can visit our project page at: https://moreirap12.github.io/mbc-browse-app/

Authors:Zhiwei Liu, Paul Thompson, Jiaqi Rong, Sophia Ananiadou
Title: ConspEmoLLM-v2: A robust and stable model to detect sentiment-transformed conspiracy theories
Abstract:
Despite the many benefits of large language models (LLMs), they can also cause harm, e.g., through automatic generation of misinformation, including conspiracy theories. Moreover, LLMs can also ''disguise'' conspiracy theories by altering characteristic textual features, e.g., by transforming their typically strong negative emotions into a more positive tone. Although several studies have proposed automated conspiracy theory detection methods, they are usually trained using human-authored text, whose features can vary from LLM-generated text. Furthermore, several conspiracy detection models, including the previously proposed ConspEmoLLM, rely heavily on the typical emotional features of human-authored conspiracy content. As such, intentionally disguised content may evade detection. To combat such issues, we firstly developed an augmented version of the ConDID conspiracy detection dataset, ConDID-v2, which supplements human-authored conspiracy tweets with versions rewritten by an LLM to reduce the negativity of their original sentiment. The quality of the rewritten tweets was verified by combining human and LLM-based assessment. We subsequently used ConDID-v2 to train ConspEmoLLM-v2, an enhanced version of ConspEmoLLM. Experimental results demonstrate that ConspEmoLLM-v2 retains or exceeds the performance of ConspEmoLLM on the original human-authored content in ConDID, and considerably outperforms both ConspEmoLLM and several other baselines when applied to sentiment-transformed tweets in ConDID-v2. The project will be available at https://github.com/lzw108/ConspEmoLLM.
Chinese Summary: 本研究通过构建增强数据集ConDID-v2和改进检测模型ConspEmoLLM-v2,有效解决了大语言模型生成的情绪伪装型阴谋论检测难题,显著提升了针对情感修饰内容的识别性能。
English Summary: This study addresses the challenge of detecting LLM-generated conspiracy theories that disguise negative emotional cues by introducing an augmented dataset, ConDID-v2, and an enhanced detection model, ConspEmoLLM-v2, which significantly improves detection accuracy on sentiment-manipulated content.

Authors:Yu Zhang, Wenxiang Guo, Changhao Pan, Dongyu Yao, Zhiyuan Zhu, Ziyue Jiang, Yuhan Wang, Tao Jin, Zhou Zhao
Title: TCSinger 2: Customizable Multilingual Zero-shot Singing Voice Synthesis
Abstract:
Customizable multilingual zero-shot singing voice synthesis (SVS) has various potential applications in music composition and short video dubbing. However, existing SVS models overly depend on phoneme and note boundary annotations, limiting their robustness in zero-shot scenarios and producing poor transitions between phonemes and notes. Moreover, they also lack effective multi-level style control via diverse prompts. To overcome these challenges, we introduce TCSinger 2, a multi-task multilingual zero-shot SVS model with style transfer and style control based on various prompts. TCSinger 2 mainly includes three key modules: 1) Blurred Boundary Content (BBC) Encoder, predicts duration, extends content embedding, and applies masking to the boundaries to enable smooth transitions. 2) Custom Audio Encoder, uses contrastive learning to extract aligned representations from singing, speech, and textual prompts. 3) Flow-based Custom Transformer, leverages Cus-MOE, with F0 supervision, enhancing both the synthesis quality and style modeling of the generated singing voice. Experimental results show that TCSinger 2 outperforms baseline models in both subjective and objective metrics across multiple related tasks. Singing voice samples are available at https://aaronz345.github.io/TCSinger2Demo/.
中文: TCSinger 2 是一种多语言零样本歌声合成模型,通过三个关键模块解决了边界标注和风格控制的局限性,在平滑过渡和多层次风格建模上表现卓越。
English: TCSinger 2 is a multilingual zero-shot singing voice synthesis model that overcomes limitations in boundary annotations and style control through three innovative modules, achieving superior performance in smooth transitions and multi-level style modeling.

Authors:Xiaoyan Bai, Ike Peng, Aditya Singh, Chenhao Tan
Title: Concept Incongruence: An Exploration of Time and Death in Role Playing
Abstract:
Consider this prompt "Draw a unicorn with two horns". Should large language models (LLMs) recognize that a unicorn has only one horn by definition and ask users for clarifications, or proceed to generate something anyway? We introduce concept incongruence to capture such phenomena where concept boundaries clash with each other, either in user prompts or in model representations, often leading to under-specified or mis-specified behaviors. In this work, we take the first step towards defining and analyzing model behavior under concept incongruence. Focusing on temporal boundaries in the Role-Play setting, we propose three behavioral metrics--abstention rate, conditional accuracy, and answer rate--to quantify model behavior under incongruence due to the role's death. We show that models fail to abstain after death and suffer from an accuracy drop compared to the Non-Role-Play setting. Through probing experiments, we identify two main causes: (i) unreliable encoding of the "death" state across different years, leading to unsatisfactory abstention behavior, and (ii) role playing causes shifts in the model's temporal representations, resulting in accuracy drops. We leverage these insights to improve consistency in the model's abstention and answer behaviors. Our findings suggest that concept incongruence leads to unexpected model behaviors and point to future directions on improving model behavior under concept incongruence.
中文摘要:本研究提出“概念冲突”来分析语言模型处理概念边界冲突时的表现,发现模型在角色扮演中因死亡状态编码不可靠和时间表征偏移,常无法停止回答且准确性下降。
English Summary: This study introduces "concept incongruence" to examine how language models handle conflicting concept boundaries, revealing that models often fail to abstain from answering when roles die in role-play scenarios due to unreliable temporal encoding and representation shifts.

Authors:Yihong Liu, Mingyang Wang, Amir Hossein Kargaran, Felicia Körner, Ercong Nie, Barbara Plank, François Yvon, Hinrich Schütze
Title: Tracing Multilingual Factual Knowledge Acquisition in Pretraining
Abstract:
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data. However, most studies evaluate only the final model, leaving the development of factual recall and crosslingual consistency throughout pretraining largely unexplored. In this work, we trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B as a case study. We find that both accuracy and consistency improve over time for most languages. We show that this improvement is primarily driven by the fact frequency in the pretraining corpus: more frequent facts are more likely to be recalled correctly, regardless of language. Yet, some low-frequency facts in non-English languages can still be correctly recalled. Our analysis reveals that these instances largely benefit from crosslingual transfer of their English counterparts -- an effect that emerges predominantly in the early stages of pretraining. We pinpoint two distinct pathways through which multilingual factual knowledge acquisition occurs: (1) frequency-driven learning, which is dominant and language-agnostic, and (2) crosslingual transfer, which is limited in scale and typically constrained to relation types involving named entities. We release our code and data to facilitate further research at https://github.com/cisnlp/multilingual-fact-tracing.
中文摘要:本研究追踪了OLMo-7B预训练过程中事实回忆与跨语言一致性的演变,发现提升主要受训练数据中事实频率驱动,而早期阶段的跨语言迁移特别有助于低频非英语事实的正确回忆。
English Summary: This study tracks the evolution of factual recall and crosslingual consistency during OLMo-7B's pretraining, revealing that improvements are primarily driven by fact frequency in training data, with crosslingual transfer from English particularly aiding low-frequency non-English facts in early stages.

Authors:Tingchen Fu, Jiawei Gu, Yafu Li, Xiaoye Qu, Yu Cheng
Title: Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models
Abstract:
Instruction-following is essential for aligning large language models (LLMs) with user intent. While recent reasoning-oriented models exhibit impressive performance on complex mathematical problems, their ability to adhere to natural language instructions remains underexplored. In this work, we introduce MathIF, a dedicated benchmark for evaluating instruction-following in mathematical reasoning tasks. Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability, as models that reason more effectively often struggle to comply with user directives. We find that models tuned on distilled long chains-of-thought or trained with reasoning-oriented reinforcement learning often degrade in instruction adherence, especially when generation length increases. Furthermore, we show that even simple interventions can partially recover obedience, though at the cost of reasoning performance. These findings highlight a fundamental tension in current LLM training paradigms and motivate the need for more instruction-aware reasoning models. We release the code and data at https://github.com/TingchenFu/MathIF.
Chinese: 本研究提出了MathIF基准,揭示大型语言模型在提升推理能力时常削弱其遵循指令的能力,凸显了二者间的矛盾,需开发更注重指令的推理模型。
English: The study introduces MathIF, a benchmark revealing that enhancing reasoning in large language models often compromises their ability to follow instructions, highlighting a trade-off that necessitates more instruction-aware models.

Authors:Tuan-Vinh La, Minh-Hieu Nguyen, Minh-Son Dao
Title: KGAlign: Joint Semantic-Structural Knowledge Encoding for Multimodal Fake News Detection
Abstract:
Fake news detection remains a challenging problem due to the complex interplay between textual misinformation, manipulated images, and external knowledge reasoning. While existing approaches have achieved notable results in verifying veracity and cross-modal consistency, two key challenges persist: (1) Existing methods often consider only the global image context while neglecting local object-level details, and (2) they fail to incorporate external knowledge and entity relationships for deeper semantic understanding. To address these challenges, we propose a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations. Our approach leverages bottom-up attention to capture fine-grained object details, CLIP for global image semantics, and RoBERTa for context-aware text encoding. We further enhance knowledge utilization by retrieving and adaptively selecting relevant entities from a knowledge graph. The fused multi-modal features are processed through a Transformer-based classifier to predict news veracity. Experimental results demonstrate that our model outperforms recent approaches, showcasing the effectiveness of neighbor selection mechanism and multi-modal fusion for fake news detection. Our proposal introduces a new paradigm: knowledge-grounded multimodal reasoning. By integrating explicit entity-level selection and NLI-guided filtering, we shift fake news detection from feature fusion to semantically grounded verification. For reproducibility and further research, the source code is publicly at \href{https://github.com/latuanvinh1998/KGAlign}{github.com/latuanvinh1998/KGAlign}.
中文: 本文提出了一种新颖的多模态假新闻检测框架,融合视觉、文本和知识表征,通过细粒度对象细节、全局图像语义和外部知识,利用基于Transformer的分类器超越现有方法。
English: This paper introduces a novel multi-modal fake news detection framework that integrates visual, textual, and knowledge-based representations, leveraging fine-grained object details, global image semantics, and external knowledge to outperform existing methods through a Transformer-based classifier.

Authors:Haolei Xu, Yuchen Yan, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Shengpei Jiang, Kaitao Song, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning
Abstract:
Large language models (LLMs) have achieved remarkable progress on mathematical tasks through Chain-of-Thought (CoT) reasoning. However, existing mathematical CoT datasets often suffer from Thought Leaps due to experts omitting intermediate steps, which negatively impacts model learning and generalization. We propose the CoT Thought Leap Bridge Task, which aims to automatically detect leaps and generate missing intermediate reasoning steps to restore the completeness and coherence of CoT. To facilitate this, we constructed a specialized training dataset called ScaleQM+, based on the structured ScaleQuestMath dataset, and trained CoT-Bridge to bridge thought leaps. Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on NuminaMath. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques. Furthermore, CoT-Bridge demonstrate improved generalization to out-of-domain logical reasoning tasks, confirming that enhancing reasoning completeness yields broadly applicable benefits.
Chinese: 本文提出思维跳跃桥接任务,通过自动检测并补全数学思维链中缺失的推理步骤,实验证明基于桥接数据训练的模型在数学推理和逻辑任务中均获得显著性能提升与泛化能力增强。
English: This paper introduces the CoT Thought Leap Bridge Task to automatically identify and fill missing reasoning steps in mathematical Chain-of-Thought datasets, demonstrating through experiments that models trained with these bridged datasets achieve superior performance and generalization across mathematical and logical reasoning tasks.

Authors:Xiaojie Gu, Guangxu Chen, Jungang Li, Jia-Chen Gu, Xuming Hu, Kai Zhang
Title: UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models
Abstract:
Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: https://github.com/XiaojieGu/UltraEdit.
中文: UltraEdit提出了一种高效且可扩展的终身模型编辑方法,通过单步参数调整和持续归一化策略,在显著提升编辑速度的同时降低内存消耗,使大型语言模型能在消费级硬件上实现大规模知识更新。
English: UltraEdit introduces a highly efficient and scalable lifelong model editing approach that significantly accelerates editing speeds while reducing memory usage, enabling extensive updates to large language models on consumer-grade hardware.

Authors:Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No
Title: SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Abstract:
Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

Authors:Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin
Title: Beyond Words: Multimodal LLM Knows When to Speak
Abstract:
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
中文: 本研究提出MM-When2Speak多模态大语言模型,通过整合对齐的视觉、听觉和文本数据来预测最佳回应时机与类型,在对话响应时间准确性上相比现有模型提升高达四倍。
English: This study introduces MM-When2Speak, a multimodal LLM that leverages aligned visual, auditory, and textual data to enhance conversational AI by accurately predicting optimal response timing and type, achieving up to four times better timing accuracy than existing models.

Authors:Yu Ying Chiu, Zhilin Wang, Sharan Maiya, Yejin Choi, Kyle Fish, Sydney Levine, Evan Hubinger
Title: Will AI Tell Lies to Save Sick Children? Litmus-Testing AI Values Prioritization with AIRiskDilemmas
Abstract:
Detecting AI risks becomes more challenging as stronger models emerge and find novel methods such as Alignment Faking to circumvent these detection attempts. Inspired by how risky behaviors in humans (i.e., illegal activities that may hurt others) are sometimes guided by strongly-held values, we believe that identifying values within AI models can be an early warning system for AI's risky behaviors. We create LitmusValues, an evaluation pipeline to reveal AI models' priorities on a range of AI value classes. Then, we collect AIRiskDilemmas, a diverse collection of dilemmas that pit values against one another in scenarios relevant to AI safety risks such as Power Seeking. By measuring an AI model's value prioritization using its aggregate choices, we obtain a self-consistent set of predicted value priorities that uncover potential risks. We show that values in LitmusValues (including seemingly innocuous ones like Care) can predict for both seen risky behaviors in AIRiskDilemmas and unseen risky behaviors in HarmBench.
中文: 随着AI模型通过“对齐伪装”等新方法规避检测,识别风险愈发困难,因此我们开发了LitmusValues评估流程,通过分析AI在价值困境中的优先选择来预测其潜在危险行为。
English: As AI models evolve with tactics like Alignment Faking, detecting risks grows more difficult, prompting the development of LitmusValues, an evaluation pipeline that identifies AI value priorities to predict risky behaviors through dilemmas and real-world benchmarks.

Authors:Fnu Mohbat, Mohammed J Zaki
Title: KERL: Knowledge-Enhanced Personalized Recipe Recommendation using Large Language Models
Abstract:
Recent advances in large language models (LLMs) and the abundance of food data have resulted in studies to improve food understanding using LLMs. Despite several recommendation systems utilizing LLMs and Knowledge Graphs (KGs), there has been limited research on integrating food related KGs with LLMs. We introduce KERL, a unified system that leverages food KGs and LLMs to provide personalized food recommendations and generates recipes with associated micro-nutritional information. Given a natural language question, KERL extracts entities, retrieves subgraphs from the KG, which are then fed into the LLM as context to select the recipes that satisfy the constraints. Next, our system generates the cooking steps and nutritional information for each recipe. To evaluate our approach, we also develop a benchmark dataset by curating recipe related questions, combined with constraints and personal preferences. Through extensive experiments, we show that our proposed KG-augmented LLM significantly outperforms existing approaches, offering a complete and coherent solution for food recommendation, recipe generation, and nutritional analysis. Our code and benchmark datasets are publicly available at https://github.com/mohbattharani/KERL.
中文摘要:KERL系统创新性地将食物知识图谱与大语言模型结合,通过个性化推荐、菜谱生成和营养分析功能,经实验验证其性能显著优于现有方法,提供了完整的饮食解决方案。
English Summary: KERL is a novel system that integrates food knowledge graphs with large language models to deliver personalized food recommendations, generate detailed recipes, and provide nutritional analysis, outperforming existing methods through comprehensive evaluation.

Authors:Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Title: TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
Abstract:
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL's success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem--false negatives--where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
中文摘要:本文揭示了强化学习中验证器错误否定正确模型输出的普遍问题,并提出轻量级验证器TinyV来动态识别潜在误判,从而提升模型训练效果与收敛速度。
English Summary: This paper identifies the problem of false negatives in verifiers used for reinforcement learning of large language models, where correct answers are incorrectly rejected, and proposes TinyV, a lightweight verifier that mitigates this issue to improve training efficiency and accuracy.

Authors:Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken
Title: SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Abstract:
We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a puzzle using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-based and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. Our error analysis reveals systematic failures such as satisfiability bias, context inconsistency, and condition omission, highlighting limitations of current LLMs in search-based logical reasoning. Our code and data are publicly available at https://github.com/Anjiang-Wei/SATBench
中文: SATBench是一个通过基于布尔可满足性问题生成的逻辑谜题来评估大语言模型逻辑推理能力的基准,揭示了模型在搜索式推理中的显著局限性和系统性错误。
English: SATBench is a benchmark that assesses the logical reasoning of large language models using SAT-derived puzzles, revealing significant limitations in search-based reasoning and systematic errors like satisfiability bias.

Authors:Haoran Zhao, Yuchen Yan, Yongliang Shen, Haolei Xu, Wenqi Zhang, Kaitao Song, Jian Shao, Weiming Lu, Jun Xiao, Yueting Zhuang
Title: Let LLMs Break Free from Overthinking via Self-Braking Tuning
Abstract:
Large reasoning models (LRMs), such as OpenAI o1 and DeepSeek-R1, have significantly enhanced their reasoning capabilities by generating longer chains of thought, demonstrating outstanding performance across a variety of tasks. However, this performance gain comes at the cost of a substantial increase in redundant reasoning during the generation process, leading to high computational overhead and exacerbating the issue of overthinking. Although numerous existing approaches aim to address the problem of overthinking, they often rely on external interventions. In this paper, we propose a novel framework, Self-Braking Tuning (SBT), which tackles overthinking from the perspective of allowing the model to regulate its own reasoning process, thus eliminating the reliance on external control mechanisms. We construct a set of overthinking identification metrics based on standard answers and design a systematic method to detect redundant reasoning. This method accurately identifies unnecessary steps within the reasoning trajectory and generates training signals for learning self-regulation behaviors. Building on this foundation, we develop a complete strategy for constructing data with adaptive reasoning lengths and introduce an innovative braking prompt mechanism that enables the model to naturally learn when to terminate reasoning at an appropriate point. Experiments across mathematical benchmarks (AIME, AMC, MATH500, GSM8K) demonstrate that our method reduces token consumption by up to 60% while maintaining comparable accuracy to unconstrained models.
大型推理模型(如OpenAI o1和DeepSeek-R1)通过生成长推理链提升了性能,但冗余推理导致计算成本高昂;我们提出的自制动调优框架使模型能自主调控推理过程,在保持精度的同时将令牌消耗降低高达60%。
Large reasoning models like OpenAI o1 and DeepSeek-R1 achieve strong performance through extended reasoning chains but suffer from computational inefficiency due to redundant steps, which our Self-Braking Tuning method addresses by enabling models to self-regulate reasoning length, cutting token use by up to 60% while preserving accuracy.

Authors:Guangzhi Xiong, Eric Xie, Corey Williams, Myles Kim, Amir Hassan Shariatmadari, Sikun Guo, Stefan Bekiranov, Aidong Zhang
Title: Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
Abstract:
Large language models (LLMs) have shown significant potential in scientific disciplines such as biomedicine, particularly in hypothesis generation, where they can analyze vast literature, identify patterns, and suggest research directions. However, a key challenge lies in evaluating the truthfulness of generated hypotheses, as verifying their accuracy often requires substantial time and resources. Additionally, the hallucination problem in LLMs can lead to the generation of hypotheses that appear plausible but are ultimately incorrect, undermining their reliability. To facilitate the systematic study of these challenges, we introduce TruthHypo, a benchmark for assessing the capabilities of LLMs in generating truthful scientific hypotheses, and KnowHD, a knowledge-based hallucination detector to evaluate how well hypotheses are grounded in existing knowledge. Our results show that LLMs struggle to generate truthful hypotheses. By analyzing hallucinations in reasoning steps, we demonstrate that the groundedness scores provided by KnowHD serve as an effective metric for filtering truthful hypotheses from the diverse outputs of LLMs. Human evaluations further validate the utility of KnowHD in identifying truthful hypotheses and accelerating scientific discovery. Our data and source code are available at https://github.com/Teddy-XiongGZ/TruthHypo.
Chinese: 大语言模型在生成科学假设方面展现出潜力,但因幻觉问题面临真实性挑战,为此开发了TruthHypo基准和KnowHD检测器,有效评估并筛选出准确假设。
English: Large language models show promise in generating scientific hypotheses but face challenges in truthfulness due to hallucinations, leading to the development of TruthHypo benchmark and KnowHD detector to evaluate and filter accurate hypotheses effectively.

Authors:Xianzhen Luo, Qingfu Zhu, Zhiming Zhang, Mingzheng Xu, Tianhao Cheng, Yixuan Wang, Zheng Chu, Shijie Xuyang, Zhiyuan Ma, YuanTao Fan, Wanxiang Che
Title: Success is in the Details: Evaluate and Enhance Details Sensitivity of Code LLMs through Counterfactuals
Abstract:
Code Sensitivity refers to the ability of Code LLMs to recognize and respond to details changes in problem descriptions. While current code benchmarks and instruction data focus on difficulty and diversity, sensitivity is overlooked. We first introduce the CTF-Code benchmark, constructed using counterfactual perturbations, minimizing input changes while maximizing output changes. The evaluation shows that many LLMs have a more than 10\% performance drop compared to the original problems. To fully utilize sensitivity, CTF-Instruct, an incremental instruction fine-tuning framework, extends on existing data and uses a selection mechanism to meet the three dimensions of difficulty, diversity, and sensitivity. Experiments show that LLMs fine-tuned with CTF-Instruct data achieve over a 2\% improvement on CTF-Code, and more than a 10\% performance boost on LiveCodeBench, validating the feasibility of enhancing LLMs' sensitivity to improve performance.
中文摘要:代码敏感性指代码大语言模型识别问题描述细微变化的能力,常被现有基准忽略;为此提出的CTF-Code基准和CTF-Instruct框架有效提升了模型敏感性,实验证明该方法显著提高了模型性能。
English Summary: Code sensitivity, the ability of Code LLMs to detect subtle changes in problem descriptions, is often overlooked in benchmarks, so the CTF-Code benchmark and CTF-Instruct framework were developed to enhance this sensitivity, resulting in significant performance improvements.

Authors:Chun-Yi Kuan, Hung-yi Lee
Title: Teaching Audio-Aware Large Language Models What Does Not Hear: Mitigating Hallucinations through Synthesized Negative Samples
Abstract:
Recent advancements in audio-aware large language models (ALLMs) enable them to process and understand audio inputs. However, these models often hallucinate non-existent sound events, reducing their reliability in real-world applications. To address this, we propose LISTEN (Learning to Identify Sounds Through Extended Negative Samples), a contrastive-like training method that enhances ALLMs' ability to distinguish between present and absent sounds using synthesized data from the backbone LLM. Unlike prior approaches, our method requires no modification to LLM parameters and efficiently integrates audio representations via a lightweight adapter. Experiments show that LISTEN effectively mitigates hallucinations while maintaining impressive performance on existing audio question and reasoning benchmarks. At the same time, it is more efficient in both data and computation.

Authors:Yuanbo Fang, Haoze Sun, Jun Liu, Tao Zhang, Zenan Zhou, Weipeng Chen, Xiaofen Xing, Xiangmin Xu
Title: S2SBench: A Benchmark for Quantifying Intelligence Degradation in Speech-to-Speech Large Language Models
Abstract:
End-to-end speech large language models ((LLMs)) extend the capabilities of text-based models to directly process and generate audio tokens. However, this often leads to a decline in reasoning and generation performance compared to text input, a phenomenon referred to as intelligence degradation. To systematically evaluate this gap, we propose S2SBench, a benchmark designed to quantify performance degradation in Speech LLMs. It includes diagnostic datasets targeting sentence continuation and commonsense reasoning under audio input. We further introduce a pairwise evaluation protocol based on perplexity differences between plausible and implausible samples to measure degradation relative to text input. We apply S2SBench to analyze the training process of Baichuan-Audio, which further demonstrates the benchmark's effectiveness. All datasets and evaluation code are available at https://github.com/undobug/S2SBench.
中文摘要:端到端语音大语言模型虽能直接处理音频,却存在智能退化问题;为此我们开发了S2SBench基准来量化性能差距,并通过Baichuan-Audio验证了其有效性。
English Summary: End-to-end speech LLMs enable direct audio processing but suffer from intelligence degradation, prompting the creation of S2SBench to evaluate performance gaps and analyze models like Baichuan-Audio.

Authors:Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao
Title: Neural Incompatibility: The Unbridgeable Gap of Cross-Scale Parametric Knowledge Transfer in Large Language Models
Abstract:
Large Language Models (LLMs) offer a transparent brain with accessible parameters that encode extensive knowledge, which can be analyzed, located and transferred. Consequently, a key research challenge is to transcend traditional knowledge transfer paradigms rooted in symbolic language and achieve genuine Parametric Knowledge Transfer (PKT). Significantly, exploring effective methods for transferring knowledge across LLMs of different scales through parameters presents an intriguing and valuable research direction. In this paper, we first demonstrate $\textbf{Alignment}$ in parametric space is the fundamental prerequisite to achieve successful cross-scale PKT. We redefine the previously explored knowledge transfer as Post-Align PKT (PostPKT), which utilizes extracted parameters for LoRA initialization and requires subsequent fine-tune for alignment. Hence, to reduce cost for further fine-tuning, we introduce a novel Pre-Align PKT (PrePKT) paradigm and propose a solution called $\textbf{LaTen}$ ($\textbf{L}$oc$\textbf{a}$te-$\textbf{T}$h$\textbf{e}$n-Alig$\textbf{n}$) that aligns the parametric spaces of LLMs across scales only using several training steps without following training. Comprehensive experiments on four benchmarks demonstrate that both PostPKT and PrePKT face challenges in achieving consistently stable transfer. Through in-depth analysis, we identify $\textbf{Neural Incompatibility}$ as the ethological and parametric structural differences between LLMs of varying scales, presenting fundamental challenges to achieving effective PKT. These findings provide fresh insights into the parametric architectures of LLMs and highlight promising directions for future research on efficient PKT. Our code is available at https://github.com/Trae1ounG/Neural_Incompatibility.
中文摘要:大型语言模型支持参数知识迁移,本研究提出预对齐PKT和LaTen方法以高效对齐不同规模模型的参数空间,并揭示神经不兼容性是主要挑战。
English Summary: Large Language Models enable Parametric Knowledge Transfer (PKT), where this study introduces Pre-Align PKT and LaTen to align parametric spaces across scales efficiently, revealing Neural Incompatibility as a key challenge.

Authors:Paweł Batorski, Adrian Kosmala, Paul Swoboda
Title: PRL: Prompts from Reinforcement Learning
Abstract:
Effective prompt engineering remains a central challenge in fully harnessing the capabilities of LLMs. While well-designed prompts can dramatically enhance performance, crafting them typically demands expert intuition and a nuanced understanding of the task. Moreover, the most impactful prompts often hinge on subtle semantic cues, ones that may elude human perception but are crucial for guiding LLM behavior. In this paper, we introduce PRL (Prompts from Reinforcement Learning), a novel RL-based approach for automatic prompt generation. Unlike previous methods, PRL can produce novel few-shot examples that were not seen during training. Our approach achieves state-of-the-art performance across a range of benchmarks, including text classification, simplification, and summarization. On the classification task, it surpasses prior methods by 2.58% over APE and 1.00% over EvoPrompt. Additionally, it improves the average ROUGE scores on the summarization task by 4.32 over APE and by 2.12 over EvoPrompt and the SARI score on simplification by 6.93 over APE and by 6.01 over EvoPrompt. Our code is available at https://github.com/Batorskq/prl .
中文: 本文提出PRL,一种基于强化学习的自动提示生成方法,能够创建训练中未见过的全新少样本示例,并在文本分类、简化及摘要任务中实现了最先进的性能表现。
English: This paper introduces PRL, a reinforcement learning-based method for automatic prompt generation that creates novel few-shot examples unseen during training and achieves state-of-the-art performance across text classification, simplification, and summarization benchmarks.

Authors:Peter Baile Chen, Yi Zhang, Dan Roth, Samuel Madden, Jacob Andreas, Michael Cafarella
Title: Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation
Abstract:
While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model's ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.

Authors:Sho Inoue, Shai Wang, Haizhou Li
Title: PersonaTAB: Predicting Personality Traits using Textual, Acoustic, and Behavioral Cues in Fully-Duplex Speech Dialogs
Abstract:
Despite significant progress in neural spoken dialog systems, personality-aware conversation agents -- capable of adapting behavior based on personalities -- remain underexplored due to the absence of personality annotations in speech datasets. We propose a pipeline that preprocesses raw audio recordings to create a dialogue dataset annotated with timestamps, response types, and emotion/sentiment labels. We employ an automatic speech recognition (ASR) system to extract transcripts and timestamps, then generate conversation-level annotations. Leveraging these annotations, we design a system that employs large language models to predict conversational personality. Human evaluators were engaged to identify conversational characteristics and assign personality labels. Our analysis demonstrates that the proposed system achieves stronger alignment with human judgments compared to existing approaches.
中文: 该研究通过创建带自动标注的对话数据集并利用大语言模型预测对话个性,解决了语音数据中缺乏个性标注的问题,其系统比现有方法更符合人类判断。
English: The study addresses the lack of personality annotations in speech datasets by creating a dialogue dataset with automated annotations and using large language models to predict conversational personality, achieving better alignment with human judgments than existing methods.

Authors:Jinwang Song, Hongying Zan, Kunli Zhang, Lingling Mu, Yingjie Han, Haobo Hua, Min Peng
Title: JOLT-SQL: Joint Loss Tuning of Text-to-SQL with Confusion-aware Noisy Schema Sampling
Abstract:
Text-to-SQL, which maps natural language to SQL queries, has benefited greatly from recent advances in Large Language Models (LLMs). While LLMs offer various paradigms for this task, including prompting and supervised fine-tuning (SFT), SFT approaches still face challenges such as complex multi-stage pipelines and poor robustness to noisy schema information. To address these limitations, we present JOLT-SQL, a streamlined single-stage SFT framework that jointly optimizes schema linking and SQL generation via a unified loss. JOLT-SQL employs discriminative schema linking, enhanced by local bidirectional attention, alongside a confusion-aware noisy schema sampling strategy with selective attention to improve robustness under noisy schema conditions. Experiments on the Spider and BIRD benchmarks demonstrate that JOLT-SQL achieves state-of-the-art execution accuracy among comparable-size open-source models, while significantly improving both training and inference efficiency. Our code is available at https://github.com/Songjw133/JOLT-SQL.
中文:JOLT-SQL是一种简化的单阶段监督微调框架,通过联合优化模式链接和SQL生成,在基准测试中实现了最先进的执行准确率和效率提升。
English: JOLT-SQL is a streamlined single-stage supervised fine-tuning framework that jointly optimizes schema linking and SQL generation, achieving state-of-the-art execution accuracy and improved efficiency on benchmarks.

Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Praneeth Vepakomma
Title: ABBA: Highly Expressive Hadamard Product Adaptation for Large Language Models
Abstract:
Large Language Models have demonstrated strong performance across a wide range of tasks, but adapting them efficiently to new domains remains a key challenge. Parameter-Efficient Fine-Tuning (PEFT) methods address this by introducing lightweight, trainable modules while keeping most pre-trained weights fixed. The prevailing approach, LoRA, models updates using a low-rank decomposition, but its expressivity is inherently constrained by the rank. Recent methods like HiRA aim to increase expressivity by incorporating a Hadamard product with the frozen weights, but still rely on the structure of the pre-trained model. We introduce ABBA, a new PEFT architecture that reparameterizes the update as a Hadamard product of two independently learnable low-rank matrices. In contrast to prior work, ABBA fully decouples the update from the pre-trained weights, enabling both components to be optimized freely. This leads to significantly higher expressivity under the same parameter budget. We formally analyze ABBA's expressive capacity and validate its advantages through matrix reconstruction experiments. Empirically, ABBA achieves state-of-the-art results on arithmetic and commonsense reasoning benchmarks, consistently outperforming existing PEFT methods by a significant margin across multiple models. Our code is publicly available at: https://github.com/CERT-Lab/abba.
中文: ABBA提出了一种新的参数高效微调架构,通过两个独立低秩矩阵的哈达玛积实现与预训练权重的完全解耦,在相同参数预算下显著提升表达能力,并在多项推理基准测试中大幅领先现有方法。
English: ABBA introduces a novel parameter-efficient fine-tuning architecture that decouples updates from pre-trained weights using two independent low-rank matrices, achieving superior expressivity and state-of-the-art performance across reasoning benchmarks.

Authors:Kaustubh Ponkshe, Shaan Shah, Raghav Singhal, Praneeth Vepakomma
Title: Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Abstract:
Large Language Models (LLMs) rely on safety alignment to produce socially acceptable responses. This is typically achieved through instruction tuning and reinforcement learning from human feedback. However, this alignment is known to be brittle: further fine-tuning, even on benign or lightly contaminated data, can degrade safety and reintroduce harmful behaviors. A growing body of work suggests that alignment may correspond to identifiable geometric directions in weight space, forming subspaces that could, in principle, be isolated or preserved to defend against misalignment. In this work, we conduct a comprehensive empirical study of this geometric perspective. We examine whether safety-relevant behavior is concentrated in specific subspaces, whether it can be separated from general-purpose learning, and whether harmfulness arises from distinguishable patterns in internal representations. Across both parameter and activation space, our findings are consistent: subspaces that amplify safe behaviors also amplify unsafe ones, and prompts with different safety implications activate overlapping representations. We find no evidence of a subspace that selectively governs safety. These results challenge the assumption that alignment is geometrically localized. Rather than residing in distinct directions, safety appears to emerge from entangled, high-impact components of the model's broader learning dynamics. This suggests that subspace-based defenses may face fundamental limitations and underscores the need for alternative strategies to preserve alignment under continued training. We corroborate these findings through multiple experiments on five open-source LLMs. Our code is publicly available at: https://github.com/CERT-Lab/safety-subspaces.
Large language models' safety alignment is fragile and can be compromised during fine-tuning, as safety is deeply entangled with general learning components rather than isolated in distinct subspaces, limiting the effectiveness of subspace-based defenses.
English Summary:

Authors:Tong Bao, Heng Zhang, Chengzhi Zhang
Title: Enhancing Abstractive Summarization of Scientific Papers Using Structure Information
Abstract:
Abstractive summarization of scientific papers has always been a research focus, yet existing methods face two main challenges. First, most summarization models rely on Encoder-Decoder architectures that treat papers as sequences of words, thus fail to fully capture the structured information inherent in scientific papers. Second, existing research often use keyword mapping or feature engineering to identify the structural information, but these methods struggle with the structural flexibility of scientific papers and lack robustness across different disciplines. To address these challenges, we propose a two-stage abstractive summarization framework that leverages automatic recognition of structural functions within scientific papers. In the first stage, we standardize chapter titles from numerous scientific papers and construct a large-scale dataset for structural function recognition. A classifier is then trained to automatically identify the key structural components (e.g., Background, Methods, Results, Discussion), which provides a foundation for generating more balanced summaries. In the second stage, we employ Longformer to capture rich contextual relationships across sections and generating context-aware summaries. Experiments conducted on two domain-specific scientific paper summarization datasets demonstrate that our method outperforms advanced baselines, and generates more comprehensive summaries. The code and dataset can be accessed at https://github.com/tongbao96/code-for-SFR-AS.
Chinese: 本研究提出了一种两阶段摘要生成框架,通过分类器自动识别科学论文的结构功能,并利用Longformer生成上下文感知的摘要,实验表明该方法优于现有基准模型且能生成更全面的摘要内容。
English: This study introduces a two-stage abstractive summarization framework that first identifies structural functions in scientific papers using a classifier and then employs Longformer to generate context-aware summaries, outperforming existing methods and producing more comprehensive results.

Authors:Chengzhi Zhang, Xinyi Yan, Lei Zhao, Yingyi Zhang
Title: Enhancing Keyphrase Extraction from Academic Articles Using Section Structure Information
Abstract:
The exponential increase in academic papers has significantly increased the time required for researchers to access relevant literature. Keyphrase Extraction (KPE) offers a solution to this situation by enabling researchers to efficiently retrieve relevant literature. The current study on KPE from academic articles aims to improve the performance of extraction models through innovative approaches using Title and Abstract as input corpora. However, the semantic richness of keywords is significantly constrained by the length of the abstract. While full-text-based KPE can address this issue, it simultaneously introduces noise, which significantly diminishes KPE performance. To address this issue, this paper utilized the structural features and section texts obtained from the section structure information of academic articles to extract keyphrase from academic papers. The approach consists of two main parts: (1) exploring the effect of seven structural features on KPE models, and (2) integrating the extraction results from all section texts used as input corpora for KPE models via a keyphrase integration algorithm to obtain the keyphrase integration result. Furthermore, this paper also examined the effect of the classification quality of section structure on the KPE performance. The results show that incorporating structural features improves KPE performance, though different features have varying effects on model efficacy. The keyphrase integration approach yields the best performance, and the classification quality of section structure can affect KPE performance. These findings indicate that using the section structure information of academic articles contributes to effective KPE from academic articles. The code and dataset supporting this study are available at https://github.com/yan-xinyi/SSB_KPE.
中文: 本研究通过利用学术论文的结构特征和整合章节文本,改进了关键词提取方法,结果表明该方法能提升性能,尽管不同特征效果各异且依赖于章节分类质量。
English: This study enhances keyphrase extraction from academic papers by leveraging structural features and integrating section texts, demonstrating that this approach improves performance despite varying feature impacts and dependency on section classification quality.

Authors:Tianle Gu, Zongqi Wang, Kexin Huang, Yuanqi Yao, Xiangliang Zhang, Yujiu Yang, Xiuying Chen
Title: Invisible Entropy: Towards Safe and Efficient Low-Entropy LLM Watermarking
Abstract:
Logit-based LLM watermarking traces and verifies AI-generated content by maintaining green and red token lists and increasing the likelihood of green tokens during generation. However, it fails in low-entropy scenarios, where predictable outputs make green token selection difficult without disrupting natural text flow. Existing approaches address this by assuming access to the original LLM to calculate entropy and selectively watermark high-entropy tokens. However, these methods face two major challenges: (1) high computational costs and detection delays due to reliance on the original LLM, and (2) potential risks of model leakage. To address these limitations, we propose Invisible Entropy (IE), a watermarking paradigm designed to enhance both safety and efficiency. Instead of relying on the original LLM, IE introduces a lightweight feature extractor and an entropy tagger to predict whether the entropy of the next token is high or low. Furthermore, based on theoretical analysis, we develop a threshold navigator that adaptively sets entropy thresholds. It identifies a threshold where the watermark ratio decreases as the green token count increases, enhancing the naturalness of the watermarked text and improving detection robustness. Experiments on HumanEval and MBPP datasets demonstrate that IE reduces parameter size by 99\% while achieving performance on par with state-of-the-art methods. Our work introduces a safe and efficient paradigm for low-entropy watermarking. https://github.com/Carol-gutianle/IE https://huggingface.co/datasets/Carol0110/IE-Tagger
中文摘要:提出的隐形熵水印方法通过轻量级特征提取器和自适应阈值,在不依赖原始大语言模型的情况下有效解决低熵场景水印难题,参数量减少99%的同时保持优异性能。
English Summary: The proposed Invisible Entropy (IE) watermarking method overcomes limitations of existing approaches by using a lightweight feature extractor and adaptive thresholding to efficiently watermark low-entropy text without relying on the original LLM, achieving 99% parameter reduction while maintaining performance.

Authors:Yakun Zhu, Zhongzhen Huang, Linjie Mu, Yutong Huang, Wei Nie, Jiaji Liu, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang
Title: DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language Models
Abstract:
The emergence of groundbreaking large language models capable of performing complex reasoning tasks holds significant promise for addressing various scientific challenges, including those arising in complex clinical scenarios. To enable their safe and effective deployment in real-world healthcare settings, it is urgently necessary to benchmark the diagnostic capabilities of current models systematically. Given the limitations of existing medical benchmarks in evaluating advanced diagnostic reasoning, we present DiagnosisArena, a comprehensive and challenging benchmark designed to rigorously assess professional-level diagnostic competence. DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties, deriving from clinical case reports published in 10 top-tier medical journals. The benchmark is developed through a meticulous construction pipeline, involving multiple rounds of screening and review by both AI systems and human experts, with thorough checks conducted to prevent data leakage. Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively. This finding highlights a significant generalization bottleneck in current large language models when faced with clinical diagnostic reasoning challenges. Through DiagnosisArena, we aim to drive further advancements in AI's diagnostic reasoning capabilities, enabling more effective solutions for real-world clinical diagnostic challenges. We provide the benchmark and evaluation tools for further research and development https://github.com/SPIRAL-MED/DiagnosisArena.
Chinese: 突破性的大语言模型在应对临床诊断等科学挑战方面展现出潜力,但当前先进模型在专业级诊断推理上仍面临困难,这通过它们在DiagnosisArena新基准测试中的低准确率得以体现。
English: Groundbreaking large language models show promise for tackling scientific challenges like clinical diagnostics, but current advanced models still struggle with professional-level diagnostic reasoning, as shown by their low accuracy on the new DiagnosisArena benchmark.

Authors:Bao-Ngoc Dao, Quang Nguyen, Luyen Ngo Dinh, Minh Le, Nam Le, Linh Ngo Van
Title: Towards Rehearsal-Free Continual Relation Extraction: Capturing Within-Task Variance with Adaptive Prompting
Abstract:
Memory-based approaches have shown strong performance in Continual Relation Extraction (CRE). However, storing examples from previous tasks increases memory usage and raises privacy concerns. Recently, prompt-based methods have emerged as a promising alternative, as they do not rely on storing past samples. Despite this progress, current prompt-based techniques face several core challenges in CRE, particularly in accurately identifying task identities and mitigating catastrophic forgetting. Existing prompt selection strategies often suffer from inaccuracies, lack robust mechanisms to prevent forgetting in shared parameters, and struggle to handle both cross-task and within-task variations. In this paper, we propose WAVE++, a novel approach inspired by the connection between prefix-tuning and mixture of experts. Specifically, we introduce task-specific prompt pools that enhance flexibility and adaptability across diverse tasks while avoiding boundary-spanning risks; this design more effectively captures variations within each task and across tasks. To further refine relation classification, we incorporate label descriptions that provide richer, more global context, enabling the model to better distinguish among different relations. We also propose a training-free mechanism to improve task prediction during inference. Moreover, we integrate a generative model to consolidate prior knowledge within the shared parameters, thereby removing the need for explicit data storage. Extensive experiments demonstrate that WAVE++ outperforms state-of-the-art prompt-based and rehearsal-based methods, offering a more robust solution for continual relation extraction. Our code is publicly available at https://github.com/PiDinosauR2804/WAVE-CRE-PLUS-PLUS.
中文:WAVE++通过任务特定的提示池和标签描述,提升了持续关系抽取的适应性和分类能力,无需存储历史数据即可超越现有方法。
English: WAVE++ introduces task-specific prompt pools and label descriptions to enhance adaptability and classification in continual relation extraction, outperforming existing methods without storing past data.

Authors:Saydul Akbar Murad, Ashim Dahal, Nick Rahimi
Title: EEG-to-Text Translation: A Model for Deciphering Human Brain Activity
Abstract:
With the rapid advancement of large language models like Gemini, GPT, and others, bridging the gap between the human brain and language processing has become an important area of focus. To address this challenge, researchers have developed various models to decode EEG signals into text. However, these models still face significant performance limitations. To overcome these shortcomings, we propose a new model, R1 Translator, which aims to improve the performance of EEG-to-text decoding. The R1 Translator model combines a bidirectional LSTM encoder with a pretrained transformer-based decoder, utilizing EEG features to produce high-quality text outputs. The model processes EEG embeddings through the LSTM to capture sequential dependencies, which are then fed into the transformer decoder for effective text generation. The R1 Translator excels in ROUGE metrics, outperforming both T5 (previous research) and Brain Translator. Specifically, R1 achieves a ROUGE-1 score of 38.00% (P), which is up to 9% higher than T5 (34.89%) and 3% better than Brain (35.69%). It also leads in ROUGE-L, with a F1 score of 32.51%, outperforming T5 by 3% (29.67%) and Brain by 2% (30.38%). In terms of CER, R1 achieves a CER of 0.5795, which is 2% lower than T5 (0.5917) and 4% lower than Brain (0.6001). Additionally, R1 performs better in WER with a score of 0.7280, outperforming T5 by 4.3% (0.7610) and Brain by 3.6% (0.7553). Code is available at https://github.com/Mmurrad/EEG-To-text.
Chinese: R1 Translator模型结合双向LSTM编码器和预训练Transformer解码器,在脑电信号转文本任务中显著提升性能,各项ROUGE指标、CER和WER均优于现有模型。
English: The R1 Translator model, integrating a bidirectional LSTM encoder with a transformer-based decoder, significantly enhances EEG-to-text decoding by outperforming existing models in ROUGE metrics, CER, and WER.

Authors:Yanheng He, Jiahe Jin, Pengfei Liu
Title: Efficient Agent Training for Computer Use
Abstract:
Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further improved data quality by synthesizing diverse action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141% relative improvement, surpassing the strong Claude 3.7 Sonnet with extended thinking on WindowsAgentArena-V2, an improved benchmark we also released. Furthermore, PC Agent-E demonstrates strong generalizability to different operating systems on OSWorld. Our findings suggest that strong computer use capabilities can be stimulated from a small amount of high-quality trajectory data.
中文: PC Agent-E 是一种高效的智能体训练框架,仅通过312条人工标注轨迹并结合合成动作决策,显著降低了对大规模人类示范数据的依赖,在基准测试中实现了141%的性能提升,并展现出强大的跨平台泛化能力。
English: PC Agent-E is an efficient training framework that significantly reduces the need for large-scale human demonstrations by using just 312 annotated trajectories enhanced with synthesized actions, achieving a 141% improvement on benchmarks and demonstrating strong cross-platform generalization.

Authors:Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang, Zimo Meng, Zhengyang Zhao, Bohan Zeng, Zhengzhou Zhu, Bin Cui, Wentao Zhang
Title: Let's Verify Math Questions Step by Step
Abstract:
Large Language Models (LLMs) have recently achieved remarkable progress in mathematical reasoning. To enable such capabilities, many existing works distill strong reasoning models into long chains of thought or design algorithms to construct high-quality math QA data for training. However, these efforts primarily focus on generating correct reasoning paths and answers, while largely overlooking the validity of the questions themselves. In this work, we propose Math Question Verification (MathQ-Verify), a novel five-stage pipeline designed to rigorously filter ill-posed or under-specified math problems. MathQ-Verify first performs format-level validation to remove redundant instructions and ensure that each question is syntactically well-formed. It then formalizes each question, decomposes it into atomic conditions, and verifies them against mathematical definitions. Next, it detects logical contradictions among these conditions, followed by a goal-oriented completeness check to ensure the question provides sufficient information for solving. To evaluate this task, we use existing benchmarks along with an additional dataset we construct, containing 2,147 math questions with diverse error types, each manually double-validated. Experiments show that MathQ-Verify achieves state-of-the-art performance across multiple benchmarks, improving the F1 score by up to 25 percentage points over the direct verification baseline. It further attains approximately 90% precision and 63% recall through a lightweight model voting scheme. MathQ-Verify offers a scalable and accurate solution for curating reliable mathematical datasets, reducing label noise and avoiding unnecessary computation on invalid questions. Our code and data are available at https://github.com/scuuy/MathQ-Verify.
Chinese: 本文提出的MathQ-Verify通过五个阶段的流程,有效过滤定义不清的数学问题,包括验证格式、分解条件、检测矛盾及确保完整性,实现了最优性能并提升了数据集的可靠性。
English: This paper introduces MathQ-Verify, a five-stage pipeline that effectively filters ill-posed math problems by validating format, decomposing conditions, detecting contradictions, and ensuring completeness, achieving state-of-the-art performance and enhancing dataset reliability.

Authors:Jiwon Song, Dongwon Jo, Yulhwa Kim, Jae-Joon Kim
Title: Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning
Abstract:
Recent reasoning-focused language models achieve high accuracy by generating lengthy intermediate reasoning paths before producing final answers. While this approach is effective in solving problems that require logical thinking, long reasoning paths significantly increase memory usage and throughput of token generation, limiting the practical deployment of such models. We propose Reasoning Path Compression (RPC), a training-free method that accelerates inference by leveraging the semantic sparsity of reasoning paths. RPC periodically compresses the KV cache by retaining KV cache that receive high importance score, which are computed using a selector window composed of recently generated queries. Experiments show that RPC improves generation throughput of QwQ-32B by up to 1.60$\times$ compared to the inference with full KV cache, with an accuracy drop of 1.2% on the AIME 2024 benchmark. Our findings demonstrate that semantic sparsity in reasoning traces can be effectively exploited for compression, offering a practical path toward efficient deployment of reasoning LLMs. Our code is available at https://github.com/jiwonsong-dev/ReasoningPathCompression.
Chinese: 近期专注于推理的语言模型虽能实现高准确率,但因冗长的推理路径导致内存使用增加和生成吞吐量下降,为此我们提出推理路径压缩(RPC)这一无需训练的方法,通过基于语义稀疏性压缩KV缓存来加速推理,在AIME 2024基准测试中使QwQ-32B的生成吞吐量提升高达1.60倍,而准确率仅下降1.2%。
English: Recent reasoning-focused language models achieve high accuracy but suffer from increased memory usage and reduced throughput due to lengthy reasoning paths, prompting the introduction of Reasoning Path Compression (RPC), a training-free method that accelerates inference by compressing the KV cache based on semantic sparsity, improving generation throughput by up to 1.60× with minimal accuracy loss.

Authors:Dan Ofer, Michal Linial, Dafna Shahaf
Title: InterFeat: A Pipeline for Finding Interesting Scientific Features
Abstract:
Finding interesting phenomena is the core of scientific discovery, but it is a manual, ill-defined concept. We present an integrative pipeline for automating the discovery of interesting simple hypotheses (feature-target relations with effect direction and a potential underlying mechanism) in structured biomedical data. The pipeline combines machine learning, knowledge graphs, literature search and Large Language Models. We formalize "interestingness" as a combination of novelty, utility and plausibility. On 8 major diseases from the UK Biobank, our pipeline consistently recovers risk factors years before their appearance in the literature. 40--53% of our top candidates were validated as interesting, compared to 0--7% for a SHAP-based baseline. Overall, 28% of 109 candidates were interesting to medical experts. The pipeline addresses the challenge of operationalizing "interestingness" scalably and for any target. We release data and code: https://github.com/LinialLab/InterFeat
中文: 该研究开发了一种自动化流程,融合机器学习、知识图谱和大语言模型,从生物医学数据中发掘新颖、实用且合理的新假设,能提前数年识别风险因素,并获得远高于基线方法的高专家验证率。
English: This study introduces an automated pipeline that combines machine learning, knowledge graphs, and large language models to discover novel, useful, and plausible hypotheses in biomedical data, successfully identifying risk factors years ahead of literature and achieving high expert validation rates.

Authors:Khanh-Tung Tran, Barry O'Sullivan, Hoang D. Nguyen
Title: IRLBench: A Multi-modal, Culturally Grounded, Parallel Irish-English Benchmark for Open-Ended LLM Reasoning Evaluation
Abstract:
Recent advances in Large Language Models (LLMs) have demonstrated promising knowledge and reasoning abilities, yet their performance in multilingual and low-resource settings remains underexplored. Existing benchmarks often exhibit cultural bias, restrict evaluation to text-only, rely on multiple-choice formats, and, more importantly, are limited for extremely low-resource languages. To address these gaps, we introduce IRLBench, presented in parallel English and Irish, which is considered definitely endangered by UNESCO. Our benchmark consists of 12 representative subjects developed from the 2024 Irish Leaving Certificate exams, enabling fine-grained analysis of model capabilities across domains. By framing the task as long-form generation and leveraging the official marking scheme, it does not only support a comprehensive evaluation of correctness but also language fidelity. Our extensive experiments of leading closed-source and open-source LLMs reveal a persistent performance gap between English and Irish, in which models produce valid Irish responses less than 80\% of the time, and answer correctly 55.8\% of the time compared to 76.2\% in English for the best-performing model. We release IRLBench (https://huggingface.co/datasets/ReliableAI/IRLBench) and an accompanying evaluation codebase (https://github.com/ReML-AI/IRLBench) to enable future research on robust, culturally aware multilingual AI development.
中文:IRLBench是基于爱尔兰毕业考试开发的双语评测基准,通过长文本生成任务评估大语言模型在英语和爱尔兰语中的表现,揭示了显著性能差距,旨在推动具有文化意识的多语言人工智能研究发展。
English: IRLBench is a new bilingual benchmark developed from Irish Leaving Certificate exams to evaluate LLMs' long-form generation in both English and Irish, revealing significant performance gaps and promoting culturally aware multilingual AI research.

Authors:Xingyuan Lu, Yuxi Liu, Dongyu Zhang, Zhiyao Wu, Jing Ren, Feng Xia
Title: EmoMeta: A Multimodal Dataset for Fine-grained Emotion Classification in Chinese Metaphors
Abstract:
Metaphors play a pivotal role in expressing emotions, making them crucial for emotional intelligence. The advent of multimodal data and widespread communication has led to a proliferation of multimodal metaphors, amplifying the complexity of emotion classification compared to single-mode scenarios. However, the scarcity of research on constructing multimodal metaphorical fine-grained emotion datasets hampers progress in this domain. Moreover, existing studies predominantly focus on English, overlooking potential variations in emotional nuances across languages. To address these gaps, we introduce a multimodal dataset in Chinese comprising 5,000 text-image pairs of metaphorical advertisements. Each entry is meticulously annotated for metaphor occurrence, domain relations and fine-grained emotion classification encompassing joy, love, trust, fear, sadness, disgust, anger, surprise, anticipation, and neutral. Our dataset is publicly accessible (https://github.com/DUTIR-YSQ/EmoMeta), facilitating further advancements in this burgeoning field.
中文摘要:隐喻对情感智能至关重要,但多模态数据集匮乏且研究多限于英语,为此我们构建了包含5000个标注文本-图像对的中文多模态隐喻广告数据集,以推动该领域发展。
English Summary: Metaphors are essential for emotional intelligence, but the lack of multimodal datasets and cross-linguistic studies hinders progress, so we created a publicly available Chinese dataset of 5,000 annotated text-image metaphorical advertisements to advance this field.

Authors:Avinash Patil, Siru Tao, Amardeep Gedhu
Title: Evaluating Reasoning LLMs for Suicide Screening with the Columbia-Suicide Severity Rating Scale
Abstract:
Suicide prevention remains a critical public health challenge. While online platforms such as Reddit's r/SuicideWatch have historically provided spaces for individuals to express suicidal thoughts and seek community support, the advent of large language models (LLMs) introduces a new paradigm-where individuals may begin disclosing ideation to AI systems instead of humans. This study evaluates the capability of LLMs to perform automated suicide risk assessment using the Columbia-Suicide Severity Rating Scale (C-SSRS). We assess the zero-shot performance of six models-including Claude, GPT, Mistral, and LLaMA-in classifying posts across a 7-point severity scale (Levels 0-6). Results indicate that Claude and GPT closely align with human annotations, while Mistral achieves the lowest ordinal prediction error. Most models exhibit ordinal sensitivity, with misclassifications typically occurring between adjacent severity levels. We further analyze confusion patterns, misclassification sources, and ethical considerations, underscoring the importance of human oversight, transparency, and cautious deployment. Full code and supplementary materials are available at https://github.com/av9ash/llm_cssrs_code.
中文摘要:本研究评估了六种大型语言模型使用C-SSRS量表进行自杀风险评估的能力,发现Claude和GPT与人工标注最为接近,同时强调了部署过程中的伦理考量。
English Summary: This study evaluates six large language models for automated suicide risk assessment using the C-SSRS scale, finding that Claude and GPT align best with human ratings while highlighting ethical considerations for deployment.

Authors:Xiaoyuan Liu, Tian Liang, Zhiwei He, Jiahao Xu, Wenxuan Wang, Pinjia He, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards
Abstract:
Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is ``superficial self-reflection'', where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model's problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.
中文: RISE作为一种在线强化学习框架,通过结果验证的整合反馈机制,同步提升大语言模型的解题能力与自我核查技能。
English: RISE is an online reinforcement learning framework that trains large language models to simultaneously enhance problem-solving accuracy and self-verification skills through integrated feedback from outcome verification.

Authors:Jiajie Zhang, Nianyi Lin, Lei Hou, Ling Feng, Juanzi Li
Title: AdaptThink: Reasoning Models Can Learn When to Think
Abstract:
Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.
中文:AdaptThink是一种新颖的强化学习算法,能根据任务难度自适应选择思考与无思考模式,在多种推理任务中显著降低推理成本的同时提升性能表现。
English: AdaptThink is a novel reinforcement learning algorithm that adaptively selects between thinking and no-thinking modes based on task difficulty, significantly reducing inference costs while improving performance across various reasoning tasks.

Authors:David Anugraha, Zilu Tang, Lester James V. Miranda, Hanyang Zhao, Mohammad Rifqi Farhansyah, Garry Kuwanto, Derry Wijaya, Genta Indra Winata
Title: R3: Robust Rubric-Agnostic Reward Models
Abstract:
Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce $\shortmethodname$, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. $\shortmethodname$ enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3.
中文: 提出的$\shortmethodname$框架采用了一种与评分标准无关、可推广的奖励建模方法,通过提供可解释的评分分配来增强语言模型与多样化人类偏好对齐的透明度和灵活性。
English: The proposed $\shortmethodname$ framework introduces a rubric-agnostic, generalizable reward modeling approach that provides interpretable score assignments to enhance transparency and flexibility in aligning language models with diverse human preferences.

Authors:Nam V. Nguyen, Huy Nguyen, Quang Pham, Van Nguyen, Savitha Ramasamy, Nhat Ho
Title: CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition
Abstract:
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526
Chinese: 提出的CompeteSMoE算法引入了一种竞争机制,将令牌路由至具有最高神经响应的专家,相比现有SMoE策略,在训练大型语言模型时展现出更优的样本效率和性能,同时保持较低训练开销。
English: The proposed CompeteSMoE algorithm introduces a competition mechanism that routes tokens to experts with the highest neural response, demonstrating superior sample efficiency and performance in training large language models with low overhead compared to existing SMoE strategies.

Authors:Gongfan Fang, Xinyin Ma, Xinchao Wang
Title: Thinkless: LLM Learns When to Think
Abstract:
Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model's ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, for concise responses and for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO. Empirically, on several benchmarks such as Minerva Algebra, MATH-500, and GSM8K, Thinkless is able to reduce the usage of long-chain thinking by 50% - 90%, significantly improving the efficiency of Reasoning Language Models. The code is available at https://github.com/VainF/Thinkless
中文: Thinkless框架通过强化学习让语言模型自适应选择简短或复杂推理链,在多个基准测试中将冗余的长链推理减少50%-90%,显著提升了推理效率。
English: The Thinkless framework enables LLMs to adaptively choose between short and long reasoning chains using reinforcement learning, significantly reducing unnecessary complex reasoning by 50%-90% while maintaining performance across multiple benchmarks.

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, Wanxiang Che
Title: RBF++: Quantifying and Optimizing Reasoning Boundaries across Measurable and Unmeasurable Capabilities for Chain-of-Thought Reasoning
Abstract:
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models (LLMs) on complex tasks, spurring research into its underlying mechanisms. However, two primary challenges remain for real-world applications: (1) the lack of quantitative metrics and actionable guidelines for evaluating and optimizing measurable boundaries of CoT capability, and (2) the absence of methods to assess boundaries of unmeasurable CoT capability, such as multimodal perception. To address these gaps, we introduce the Reasoning Boundary Framework++ (RBF++). To tackle the first challenge, we define the reasoning boundary (RB) as the maximum limit of CoT performance. We also propose a combination law for RBs, enabling quantitative analysis and offering actionable guidance across various CoT tasks. For the second challenge, particularly in multimodal scenarios, we introduce a constant assumption, which replaces unmeasurable RBs with scenario-specific constants. Additionally, we propose the reasoning boundary division mechanism, which divides unmeasurable RBs into two sub-boundaries, facilitating the quantification and optimization of both unmeasurable domain knowledge and multimodal perception capabilities. Extensive experiments involving 38 models across 13 tasks validate the feasibility of our framework in cross-modal settings. Additionally, we evaluate 10 CoT strategies, offer insights into optimization and decay from two complementary perspectives, and expand evaluation benchmarks for measuring RBs in LLM reasoning. We hope this work advances the understanding of RBs and optimization strategies in LLMs. Code and data are available at https://github.com/LightChen233/reasoning-boundary.
Chinese: 为解决思维链推理在评估和优化中的挑战,提出了推理边界框架++(RBF++),通过定义可测量的性能极限和引入处理不可测量能力(如多模态感知)的创新机制,在多种任务和模型中得到验证。
English: The Reasoning Boundary Framework++ (RBF++) is introduced to address challenges in evaluating and optimizing Chain-of-Thought (CoT) reasoning by defining measurable performance limits and handling unmeasurable capabilities like multimodal perception through innovative mechanisms, validated across diverse tasks and models.

Authors:Alice Plebe, Timothy Douglas, Diana Riazi, R. Maria del Rio-Chanona
Title: I'll believe it when I see it: Images increase misinformation sharing in Vision-Language Models
Abstract:
Large language models are increasingly integrated into news recommendation systems, raising concerns about their role in spreading misinformation. In humans, visual content is known to boost credibility and shareability of information, yet its effect on vision-language models (VLMs) remains unclear. We present the first study examining how images influence VLMs' propensity to reshare news content, whether this effect varies across model families, and how persona conditioning and content attributes modulate this behavior. To support this analysis, we introduce two methodological contributions: a jailbreaking-inspired prompting strategy that elicits resharing decisions from VLMs while simulating users with antisocial traits and political alignments; and a multimodal dataset of fact-checked political news from PolitiFact, paired with corresponding images and ground-truth veracity labels. Experiments across model families reveal that image presence increases resharing rates by 4.8% for true news and 15.0% for false news. Persona conditioning further modulates this effect: Dark Triad traits amplify resharing of false news, whereas Republican-aligned profiles exhibit reduced veracity sensitivity. Of all the tested models, only Claude-3-Haiku demonstrates robustness to visual misinformation. These findings highlight emerging risks in multimodal model behavior and motivate the development of tailored evaluation frameworks and mitigation strategies for personalized AI systems. Code and dataset are available at: https://github.com/3lis/misinfo_vlm
中文: 研究表明,图像会显著增加视觉语言模型对虚假新闻的转发,其增幅远超真实新闻,且人格特质与政治倾向会进一步影响该行为,凸显了多模态AI系统的风险并亟需针对性防护措施。
English: This study reveals that images significantly increase vision-language models' resharing of false news more than true news, with persona traits and political alignments further influencing this behavior, highlighting risks in multimodal AI systems and underscoring the need for targeted safeguards.

Authors:Lei Sheng, Shuai-Shuai Xu
Title: CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning
Abstract:
Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72\% execution accuracy, while the 32B model achieves 73.67\%. The code has been open sourced at https://github.com/CycloneBoy/csc_sql.
中文: CSC-SQL方法融合自我一致性和自我修正技术,通过选择高频输出进行修订并采用强化学习微调模型,在BIRD基准测试中实现了超过71%的执行准确率。
English: The CSC-SQL method combines Self-Consistency and Self-Correction techniques to improve SQL generation accuracy by selecting top outputs for revision and fine-tuning models with reinforcement learning, achieving over 71% execution accuracy on benchmarks.

Authors:Tianshi Zheng, Zheye Deng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Zihao Wang, Yangqiu Song
Title: From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery
Abstract:
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy-Tool, Analyst, and Scientist-to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement. Github Repository: https://github.com/HKUST-KnowComp/Awesome-LLM-Scientific-Discovery.
中文摘要:大语言模型正通过从特定任务工具演变为自主智能体,重塑科研流程与人机协作,本综述构建了分类体系并展望了未来发展路径,以推动人工智能驱动的科学发现。
English Summary: Large Language Models are transforming scientific discovery by evolving from task-specific tools into autonomous agents, redefining research processes and human-AI collaboration, with this survey providing a taxonomy and strategic foresight for future advancements.

Authors:Jieying Xue, Phuong Minh Nguyen, Minh Le Nguyen, Xin Liu
Title: JNLP at SemEval-2025 Task 11: Cross-Lingual Multi-Label Emotion Detection Using Generative Models
Abstract:
With the rapid advancement of global digitalization, users from different countries increasingly rely on social media for information exchange. In this context, multilingual multi-label emotion detection has emerged as a critical research area. This study addresses SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. Our paper focuses on two sub-tracks of this task: (1) Track A: Multi-label emotion detection, and (2) Track B: Emotion intensity. To tackle multilingual challenges, we leverage pre-trained multilingual models and focus on two architectures: (1) a fine-tuned BERT-based classification model and (2) an instruction-tuned generative LLM. Additionally, we propose two methods for handling multi-label classification: the base method, which maps an input directly to all its corresponding emotion labels, and the pairwise method, which models the relationship between the input text and each emotion category individually. Experimental results demonstrate the strong generalization ability of our approach in multilingual emotion recognition. In Track A, our method achieved Top 4 performance across 10 languages, ranking 1st in Hindi. In Track B, our approach also secured Top 5 performance in 7 languages, highlighting its simplicity and effectiveness\footnote{Our code is available at https://github.com/yingjie7/mlingual_multilabel_emo_detection.
中文摘要:本研究通过采用预训练模型和创新分类方法应对多语言多标签情绪检测挑战,在SemEval-2025任务11中于多语言环境下取得领先性能。
English Summary: This study tackles multilingual multi-label emotion detection by employing pre-trained models and innovative classification methods, achieving top performance across multiple languages in SemEval-2025 Task 11.

Authors:Chih-Kai Yang, Neo Ho, Yen-Ting Piao, Hung-yi Lee
Title: SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information
Abstract:
Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.
中文摘要:该研究提出SAKURA基准测试,发现大型音频语言模型即使能正确提取语音/音频信息,仍难以进行多跳推理,揭示了多模态整合的关键缺陷。
English Summary: The study introduces SAKURA, a benchmark revealing that large audio-language models struggle with multi-hop reasoning despite correctly extracting speech/audio information, exposing a critical limitation in multimodal integration.

Authors:Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang
Title: Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Abstract:
We introduce SLED, an alternative approach to speech language modeling by encoding speech waveforms into sequences of continuous latent representations and modeling them autoregressively using an energy distance objective. The energy distance offers an analytical measure of the distributional gap by contrasting simulated and target samples, enabling efficient training to capture the underlying continuous autoregressive distribution. By bypassing reliance on residual vector quantization, SLED avoids discretization errors and eliminates the need for the complicated hierarchical architectures common in existing speech language models. It simplifies the overall modeling pipeline while preserving the richness of speech information and maintaining inference efficiency. Empirical results demonstrate that SLED achieves strong performance in both zero-shot and streaming speech synthesis, showing its potential for broader applications in general-purpose speech language models.
中文摘要:SLED是一种创新的语音语言模型,采用连续潜在表示和能量距离目标进行高效自回归训练,无需量化或复杂层级结构,在零样本和流式语音合成中均表现出色。
English Summary: SLED is a novel speech language model that uses continuous latent representations and an energy distance objective for efficient autoregressive training, eliminating the need for quantization and complex hierarchies while achieving strong performance in zero-shot and streaming synthesis.

Authors:Zihao Cheng, Hongru Wang, Zeming Liu, Yuhang Guo, Yuanfang Guo, Yunhong Wang, Haifeng Wang
Title: ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models
Abstract:
While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions, overlooking the context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs' capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool utilization. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool utilization significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code are available at https://github.com/Chengziha0/ToolSpectrum.
中文摘要:本文提出ToolSpectrum基准,通过评估用户画像和环境因素对工具选择的影响,解决增强型大语言模型缺乏情境感知个性化的问题,研究表明尽管个性化能提升用户体验,现有模型仍难以平衡这两个维度。
English Summary: This paper introduces ToolSpectrum, a benchmark addressing the lack of context-aware personalization in tool-augmented LLMs by evaluating how user profiles and environmental factors impact tool selection, revealing that while personalization improves user experience, current models struggle to balance these dimensions effectively.

Authors:Yassine El Boudouri, Walter Nuninger, Julian Alvarez, Yvan Peter
Title: Role-Playing Evaluation for Large Language Models
Abstract:
Large Language Models (LLMs) demonstrate a notable capacity for adopting personas and engaging in role-playing. However, evaluating this ability presents significant challenges, as human assessments are resource-intensive and automated evaluations can be biased. To address this, we introduce Role-Playing Eval (RPEval), a novel benchmark designed to assess LLM role-playing capabilities across four key dimensions: emotional understanding, decision-making, moral alignment, and in-character consistency. This article details the construction of RPEval and presents baseline evaluations. Our code and dataset are available at https://github.com/yelboudouri/RPEval
中文摘要:作者提出角色扮演评估(RPEval)这一新基准,通过情感理解、决策能力、道德对齐和角色一致性四个维度评估大语言模型的角色扮演能力,以解决现有评估方法的不足。
English Summary: The authors introduce Role-Playing Eval (RPEval), a novel benchmark to assess LLM role-playing capabilities across emotional understanding, decision-making, moral alignment, and in-character consistency, addressing limitations in current evaluation methods.

Authors:Ziyang Ma, Yinghao Ma, Yanqiao Zhu, Chen Yang, Yi-Wen Chao, Ruiyang Xu, Wenxi Chen, Yuanzhe Chen, Zhuo Chen, Jian Cong, Kai Li, Keliang Li, Siyou Li, Xinfeng Li, Xiquan Li, Zheng Lian, Yuzhe Liang, Minghao Liu, Zhikang Niu, Tianrui Wang, Yuping Wang, Yuxuan Wang, Yihao Wu, Guanrou Yang, Jianwei Yu, Ruibin Yuan, Zhisheng Zheng, Ziya Zhou, Haina Zhu, Wei Xue, Emmanouil Benetos, Kai Yu, Eng-Siong Chng, Xie Chen
Title: MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix
Abstract:
We introduce MMAR, a new benchmark designed to evaluate the deep reasoning capabilities of Audio-Language Models (ALMs) across massive multi-disciplinary tasks. MMAR comprises 1,000 meticulously curated audio-question-answer triplets, collected from real-world internet videos and refined through iterative error corrections and quality checks to ensure high quality. Unlike existing benchmarks that are limited to specific domains of sound, music, or speech, MMAR extends them to a broad spectrum of real-world audio scenarios, including mixed-modality combinations of sound, music, and speech. Each question in MMAR is hierarchically categorized across four reasoning layers: Signal, Perception, Semantic, and Cultural, with additional sub-categories within each layer to reflect task diversity and complexity. To further foster research in this area, we annotate every question with a Chain-of-Thought (CoT) rationale to promote future advancements in audio reasoning. Each item in the benchmark demands multi-step deep reasoning beyond surface-level understanding. Moreover, a part of the questions requires graduate-level perceptual and domain-specific knowledge, elevating the benchmark's difficulty and depth. We evaluate MMAR using a broad set of models, including Large Audio-Language Models (LALMs), Large Audio Reasoning Models (LARMs), Omni Language Models (OLMs), Large Language Models (LLMs), and Large Reasoning Models (LRMs), with audio caption inputs. The performance of these models on MMAR highlights the benchmark's challenging nature, and our analysis further reveals critical limitations of understanding and reasoning capabilities among current models. We hope MMAR will serve as a catalyst for future advances in this important but little-explored area.
中文: MMAR是一个新颖的基准,用于评估音频语言模型的深度推理能力,包含1000个高质量音频-问题-答案三元组,涵盖广泛真实场景和四个层次化推理层级,并标注思维链原理以推动音频推理研究发展。
English: MMAR is a novel benchmark for evaluating deep reasoning in Audio-Language Models, featuring 1,000 high-quality audio-question-answer triplets spanning diverse real-world scenarios and four hierarchical reasoning layers, with annotated Chain-of-Thought rationales to advance audio reasoning research.

Authors:Yuhao Qing, Boyu Zhu, Mingzhe Du, Zhijiang Guo, Terry Yue Zhuo, Qianru Zhang, Jie M. Zhang, Heming Cui, Siu-Ming Yiu, Dong Huang, See-Kiong Ng, Luu Anh Tuan
Title: EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code
Abstract:
Existing code generation benchmarks primarily evaluate functional correctness, with limited focus on code efficiency and often restricted to a single language like Python. To address this gap, we introduce EffiBench-X, the first multi-language benchmark designed to measure the efficiency of LLM-generated code. EffiBench-X supports Python, C++, Java, JavaScript, Ruby, and Golang. It comprises competitive programming tasks with human-expert solutions as efficiency baselines. Evaluating state-of-the-art LLMs on EffiBench-X reveals that while models generate functionally correct code, they consistently underperform human experts in efficiency. Even the most efficient LLM-generated solutions (Qwen3-32B) achieve only around \textbf{62\%} of human efficiency on average, with significant language-specific variations. LLMs show better efficiency in Python, Ruby, and JavaScript than in Java, C++, and Golang. For instance, DeepSeek-R1's Python code is significantly more efficient than its Java code. These results highlight the critical need for research into LLM optimization techniques to improve code efficiency across diverse languages. The dataset and evaluation infrastructure are submitted and available at https://github.com/EffiBench/EffiBench-X.git and https://huggingface.co/datasets/EffiBench/effibench-x.
中文:EffiBench-X是首个多语言代码效率评测基准,覆盖六种编程语言,研究发现大模型生成的代码虽功能正确但效率显著低于人类专家,平均仅达人类效率的62%,且存在明显的语言差异性。
English: EffiBench-X is the first multi-language benchmark to evaluate code efficiency across six programming languages, revealing that LLMs produce functionally correct but significantly less efficient code than human experts, achieving only 62% of human efficiency on average with notable variations between languages.

Authors:Baohao Liao, Hanze Dong, Yuhui Xu, Doyen Sahoo, Christof Monz, Junnan Li, Caiming Xiong
Title: Fractured Chain-of-Thought Reasoning
Abstract:
Inference-time scaling techniques have significantly bolstered the reasoning capabilities of large language models (LLMs) by harnessing additional computational effort at inference without retraining. Similarly, Chain-of-Thought (CoT) prompting and its extension, Long CoT, improve accuracy by generating rich intermediate reasoning trajectories, but these approaches incur substantial token costs that impede their deployment in latency-sensitive settings. In this work, we first show that truncated CoT, which stops reasoning before completion and directly generates the final answer, often matches full CoT sampling while using dramatically fewer tokens. Building on this insight, we introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling along three orthogonal axes: (1) the number of reasoning trajectories, (2) the number of final solutions per trajectory, and (3) the depth at which reasoning traces are truncated. Through extensive experiments on five diverse reasoning benchmarks and several model scales, we demonstrate that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget. Our analysis reveals how to allocate computation across these dimensions to maximize performance, paving the way for more efficient and scalable LLM reasoning. Code is available at https://github.com/BaohaoLiao/frac-cot.
中文:Fractured Sampling是一种创新的推理时策略,通过调整推理轨迹、截断深度和解决方案数量,在无需重新训练大语言模型的情况下,优化了推理准确性与计算成本之间的平衡,实现了更高的效率。
English: Fractured Sampling is a new inference-time strategy that optimizes the trade-off between reasoning accuracy and computational cost by adjusting reasoning trajectories, truncation depth, and solution counts, achieving superior efficiency without retraining LLMs.

Authors:Shanshan Liu, Noriki Nishida, Rumana Ferdous Munne, Narumi Tokunaga, Yuki Yamagata, Kouji Kozaki, Yuji Matsumoto
Title: MA-COIR: Leveraging Semantic Search Index and Generative Models for Ontology-Driven Biomedical Concept Recognition
Abstract:
Recognizing biomedical concepts in the text is vital for ontology refinement, knowledge graph construction, and concept relationship discovery. However, traditional concept recognition methods, relying on explicit mention identification, often fail to capture complex concepts not explicitly stated in the text. To overcome this limitation, we introduce MA-COIR, a framework that reformulates concept recognition as an indexing-recognition task. By assigning semantic search indexes (ssIDs) to concepts, MA-COIR resolves ambiguities in ontology entries and enhances recognition efficiency. Using a pretrained BART-based model fine-tuned on small datasets, our approach reduces computational requirements to facilitate adoption by domain experts. Furthermore, we incorporate large language models (LLMs)-generated queries and synthetic data to improve recognition in low-resource settings. Experimental results on three scenarios (CDR, HPO, and HOIP) highlight the effectiveness of MA-COIR in recognizing both explicit and implicit concepts without the need for mention-level annotations during inference, advancing ontology-driven concept recognition in biomedical domain applications. Our code and constructed data are available at https://github.com/sl-633/macoir-master.
中文摘要:MA-COIR框架通过将生物医学概念识别重构为索引任务,利用语义搜索标识符有效识别显性和隐性概念,无需在推理过程中进行提及级标注,显著提升了本体驱动概念识别的性能。
English Summary: MA-COIR is a novel framework that reframes biomedical concept recognition as an indexing task using semantic search identifiers, enabling efficient identification of both explicit and implicit concepts without requiring mention-level annotations during inference.

Authors:Zhihe Yang, Xufang Luo, Zilong Wang, Dongqi Han, Zhiyuan He, Dongsheng Li, Yunjian Xu
Title: Do Not Let Low-Probability Tokens Over-Dominate in RL for LLMs
Abstract:
Reinforcement learning (RL) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLMs), with recent innovations such as Group Relative Policy Optimization (GRPO) demonstrating exceptional effectiveness. In this study, we identify a critical yet underexplored issue in RL training: low-probability tokens disproportionately influence model updates due to their large gradient magnitudes. This dominance hinders the effective learning of high-probability tokens, whose gradients are essential for LLMs' performance but are substantially suppressed. To mitigate this interference, we propose two novel methods: Advantage Reweighting and Low-Probability Token Isolation (Lopti), both of which effectively attenuate gradients from low-probability tokens while emphasizing parameter updates driven by high-probability tokens. Our approaches promote balanced updates across tokens with varying probabilities, thereby enhancing the efficiency of RL training. Experimental results demonstrate that they substantially improve the performance of GRPO-trained LLMs, achieving up to a 46.2% improvement in K&K Logic Puzzle reasoning tasks. Our implementation is available at https://github.com/zhyang2226/AR-Lopti.
中文摘要:本研究针对强化学习中低概率词元梯度干扰问题,提出了优势重加权和Lopti两种方法,有效平衡不同概率词元的参数更新,使大语言模型在推理任务中的性能提升高达46.2%。
English Summary: This research addresses the imbalance in reinforcement learning for large language models by introducing Advantage Reweighting and Lopti methods, which reduce gradient interference from low-probability tokens and enhance learning from high-probability ones, achieving up to 46.2% improvement in reasoning tasks.

Authors:Chenxi Liu, Yongqiang Chen, Tongliang Liu, James Cheng, Bo Han, Kun Zhang
Title: On the Thinking-Language Modeling Gap in Large Language Models
Abstract:
System 2 reasoning is one of the defining characteristics of intelligence, which requires slow and logical thinking. Human conducts System 2 reasoning via the language of thoughts that organizes the reasoning process as a causal sequence of mental language, or thoughts. Recently, it has been observed that System 2 reasoning can be elicited from Large Language Models (LLMs) pre-trained on large-scale natural languages. However, in this work, we show that there is a significant gap between the modeling of languages and thoughts. As language is primarily a tool for humans to share knowledge and thinking, modeling human language can easily absorb language biases into LLMs deviated from the chain of thoughts in minds. Furthermore, we show that the biases will mislead the eliciting of "thoughts" in LLMs to focus only on a biased part of the premise. To this end, we propose a new prompt technique termed Language-of-Thoughts (LoT) to demonstrate and alleviate this gap. Instead of directly eliciting the chain of thoughts from partial information, LoT instructs LLMs to adjust the order and token used for the expressions of all the relevant information. We show that the simple strategy significantly reduces the language modeling biases in LLMs and improves the performance of LLMs across a variety of reasoning tasks.
中文: 本研究揭示了大型语言模型在语言与思维建模之间存在偏差,提出“思维语言”提示技术以减少语言模型偏见,从而提升各类推理任务的性能。
English: This study reveals that large language models (LLMs) exhibit a gap between language and thought modeling, leading to biased reasoning, and proposes a Language-of-Thoughts (LoT) prompting technique to reduce biases and enhance performance across reasoning tasks.

Authors:Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang
Title: TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios
Abstract:
Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , and the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME .
中文: TIME基准通过包含38,522个问答对的三个现实世界数据集,解决了大语言模型在密集时序信息、快速变化事件和复杂时序依赖等时序推理方面的不足,实验分析揭示了不同场景下的性能规律与规模扩展的影响。
English: The TIME benchmark addresses gaps in temporal reasoning for LLMs by providing 38,522 QA pairs across three real-world datasets to evaluate challenges like dense temporal data and complex event dynamics, with experimental analysis revealing performance trends and scaling effects.

Authors:Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, Zhuosheng Zhang
Title: GEM: Gaussian Embedding Modeling for Out-of-Distribution Detection in GUI Agents
Abstract:
Graphical user interface (GUI) agents have recently emerged as an intriguing paradigm for human-computer interaction, capable of automatically executing user instructions to operate intelligent terminal devices. However, when encountering out-of-distribution (OOD) instructions that violate environmental constraints or exceed the current capabilities of agents, GUI agents may suffer task breakdowns or even pose security threats. Therefore, effective OOD detection for GUI agents is essential. Traditional OOD detection methods perform suboptimally in this domain due to the complex embedding space and evolving GUI environments. In this work, we observe that the in-distribution input semantic space of GUI agents exhibits a clustering pattern with respect to the distance from the centroid. Based on the finding, we propose GEM, a novel method based on fitting a Gaussian mixture model over input embedding distances extracted from the GUI agent that reflect its capability boundary. Evaluated on eight datasets spanning smartphones, computers, and web browsers, our method achieves an average accuracy improvement of 23.70\% over the best-performing baseline while only increasing training time by 4.9\% and testing time by 6.5\%. We also experimentally demonstrate that GEM can improve the step-wise success rate by 9.40\% by requesting assistance from the cloud model when encountering OOD samples. Analysis verifies the generalization ability of our method through experiments on nine different backbones. The codes are available at https://github.com/Wuzheng02/GEM-OODforGUIagents.
中文: 图形用户界面(GUI)代理在处理超出分布(OOD)指令时易出现任务失败或安全威胁,而提出的GEM方法通过高斯混合模型有效提升检测精度和步骤成功率,适用于多种设备环境。
English: GUI agents face challenges with out-of-distribution (OOD) instructions, leading to task failures or security risks, but the proposed GEM method significantly improves OOD detection accuracy and step-wise success rates across various devices.

Authors:Zifeng Cheng, Zhonghui Wang, Yuchen Fu, Zhiwei Jiang, Yafeng Yin, Cong Wang, Qing Gu
Title: Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering
Abstract:
Extracting sentence embeddings from large language models (LLMs) is a practical direction, as it requires neither additional data nor fine-tuning. Previous studies usually focus on prompt engineering to guide LLMs to encode the core semantic information of the sentence into the embedding of the last token. However, the last token in these methods still encodes an excess of non-essential information, such as stop words, limiting its encoding capacity. To this end, we propose a Contrastive Prompting (CP) method that introduces an extra auxiliary prompt to elicit better sentence embedding. By contrasting with the auxiliary prompt, CP can steer existing prompts to encode the core semantics of the sentence, rather than non-essential information. CP is a plug-and-play inference-time intervention method that can be combined with various prompt-based methods. Extensive experiments on Semantic Textual Similarity (STS) tasks and downstream classification tasks demonstrate that our method can improve the performance of existing prompt-based methods across different LLMs. Our code will be released at https://github.com/zifengcheng/CP.
中文: 提出的对比提示方法通过引入辅助提示,使大语言模型在生成句子嵌入时聚焦核心语义,无需微调即可提升多项任务性能。
English: The proposed Contrastive Prompting (CP) method enhances sentence embeddings from LLMs by using an auxiliary prompt to focus on core semantics, improving performance across various tasks without fine-tuning.

Authors:Yanbin Yin, Kun Zhou, Zhen Wang, Xiangdong Zhang, Yifei Shao, Shibo Hao, Yi Gu, Jieyuan Liu, Somanshu Singla, Tianyang Liu, Eric P. Xing, Zhengzhong Liu, Haojian Jin, Zhiting Hu
Title: Decentralized Arena: Towards Democratic and Scalable Automatic Evaluation of Language Models
Abstract:
The recent explosion of large language models (LLMs), each with its own general or specialized strengths, makes scalable, reliable benchmarking more urgent than ever. Standard practices nowadays face fundamental trade-offs: closed-ended question-based benchmarks (eg MMLU) struggle with saturation as newer models emerge, while crowd-sourced leaderboards (eg Chatbot Arena) rely on costly and slow human judges. Recently, automated methods (eg LLM-as-a-judge) shed light on the scalability, but risk bias by relying on one or a few "authority" models. To tackle these issues, we propose Decentralized Arena (dearena), a fully automated framework leveraging collective intelligence from all LLMs to evaluate each other. It mitigates single-model judge bias by democratic, pairwise evaluation, and remains efficient at scale through two key components: (1) a coarse-to-fine ranking algorithm for fast incremental insertion of new models with sub-quadratic complexity, and (2) an automatic question selection strategy for the construction of new evaluation dimensions. Across extensive experiments across 66 LLMs, dearena attains up to 97% correlation with human judgements, while significantly reducing the cost. Our code and data will be publicly released on https://github.com/maitrix-org/de-arena.
Chinese Summary: 去中心化竞技场(dearena)是一个完全自动化的框架,利用所有大语言模型的集体智慧进行相互评估,通过高效的排名算法和问题选择策略,在显著降低成本的同时,实现了与人类判断高达97%的相关性。
English Summary: The Decentralized Arena (dearena) is a fully automated framework that leverages collective intelligence from all large language models (LLMs) to evaluate each other democratically, achieving up to 97% correlation with human judgments while significantly reducing costs through efficient ranking algorithms and question selection strategies.

Authors:Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu
Title: A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
Abstract:
Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.
Chinese: Low-Rank Clone (LRC)方法通过低秩投影实现软剪枝和激活克隆,高效训练小型语言模型,仅用200亿标记就达到顶尖性能,训练效率提升千倍以上。
English: The Low-Rank Clone (LRC) method efficiently trains Small Language Models by using low-rank projections to achieve soft pruning and activation cloning from teacher models, achieving state-of-the-art performance with only 20 billion tokens and over 1,000x training efficiency.

Authors:Han Meng, Yancan Chen, Yunan Li, Yitian Yang, Jungup Lee, Renwen Zhang, Yi-Chieh Lee
Title: What is Stigma Attributed to? A Theory-Grounded, Expert-Annotated Interview Corpus for Demystifying Mental-Health Stigma
Abstract:
Mental-health stigma remains a pervasive social problem that hampers treatment-seeking and recovery. Existing resources for training neural models to finely classify such stigma are limited, relying primarily on social-media or synthetic data without theoretical underpinnings. To remedy this gap, we present an expert-annotated, theory-informed corpus of human-chatbot interviews, comprising 4,141 snippets from 684 participants with documented socio-cultural backgrounds. Our experiments benchmark state-of-the-art neural models and empirically unpack the challenges of stigma detection. This dataset can facilitate research on computationally detecting, neutralizing, and counteracting mental-health stigma. Our corpus is openly available at https://github.com/HanMeng2004/Mental-Health-Stigma-Interview-Corpus.
中文摘要:本研究提供了一个专家标注的人机对话访谈语料库,旨在解决心理健康污名检测中缺乏理论指导数据的问题,通过基准测试评估了先进神经模型并揭示了检测难点。
English Summary: This study introduces an expert-annotated corpus of human-chatbot interviews to address the lack of theory-informed data for training neural models in detecting mental-health stigma, benchmarking state-of-the-art models and highlighting detection challenges.

Authors:Taiqiang Wu, Runming Yang, Jiayi Li, Pengfei Hu, Ngai Wong, Yujiu Yang
Title: Shadow-FT: Tuning Instruct via Base
Abstract:
Large language models (LLMs) consistently benefit from further fine-tuning on various tasks. However, we observe that directly tuning the INSTRUCT (i.e., instruction tuned) models often leads to marginal improvements and even performance degeneration. Notably, paired BASE models, the foundation for these INSTRUCT variants, contain highly similar weight values (i.e., less than 2% on average for Llama 3.1 8B). Therefore, we propose a novel Shadow-FT framework to tune the INSTRUCT models by leveraging the corresponding BASE models. The key insight is to fine-tune the BASE model, and then directly graft the learned weight updates to the INSTRUCT model. Our proposed Shadow-FT introduces no additional parameters, is easy to implement, and significantly improves performance. We conduct extensive experiments on tuning mainstream LLMs, such as Qwen 3 and Llama 3 series, and evaluate them across 19 benchmarks covering coding, reasoning, and mathematical tasks. Experimental results demonstrate that Shadow-FT consistently outperforms conventional full-parameter and parameter-efficient tuning approaches. Further analyses indicate that Shadow-FT can be applied to multimodal large language models (MLLMs) and combined with direct preference optimization (DPO). Codes and weights are available at \href{https://github.com/wutaiqiang/Shadow-FT}{Github}.
中文摘要:Shadow-FT框架通过微调基础模型并直接移植权重更新至指令调优模型,无需额外参数即可在多项基准测试中显著提升性能。
English Summary: The Shadow-FT framework enhances instruction-tuned models by fine-tuning their base counterparts and transferring the learned weight updates, achieving superior performance without additional parameters across various benchmarks.

Authors:Wenqiao Zhu, Chao Xu, Lulu Wang, Jun Wu
Title: PSC: Extending Context Window of Large Language Models via Phase Shift Calibration
Abstract:
Rotary Position Embedding (RoPE) is an efficient position encoding approach and is widely utilized in numerous large language models (LLMs). Recently, a lot of methods have been put forward to further expand the context window based on RoPE. The core concept of those methods is to predefine or search for a set of factors to rescale the base frequencies of RoPE. Nevertheless, it is quite a challenge for existing methods to predefine an optimal factor due to the exponential search space. In view of this, we introduce PSC (Phase Shift Calibration), a small module for calibrating the frequencies predefined by existing methods. With the employment of PSC, we demonstrate that many existing methods can be further enhanced, like PI, YaRN, and LongRoPE. We conducted extensive experiments across multiple models and tasks. The results demonstrate that (1) when PSC is enabled, the comparative reductions in perplexity increase as the context window size is varied from 16k, to 32k, and up to 64k. (2) Our approach is broadly applicable and exhibits robustness across a variety of models and tasks. The code can be found at https://github.com/WNQzhu/PSC.
中文: PSC(相位偏移校准)是一种新颖模块,通过校准预定义的RoPE频率来增强现有扩展大语言模型上下文窗口的方法,在多种模型和任务中提升了性能和鲁棒性。
English: PSC (Phase Shift Calibration) is a novel module that enhances existing methods for extending the context window in large language models by calibrating predefined RoPE frequencies, improving performance and robustness across various models and tasks.

Authors:Yang Hu, Xingyu Zhang, Xueji Fang, Zhiyang Chen, Xiao Wang, Huatian Zhang, Guojun Qi
Title: SLOT: Sample-specific Language Model Optimization at Test-time
Abstract:
We propose SLOT (Sample-specific Language Model Optimization at Test-time), a novel and parameter-efficient test-time inference approach that enhances a language model's ability to more accurately respond to individual prompts. Existing Large Language Models (LLMs) often struggle with complex instructions, leading to poor performances on those not well represented among general samples. To address this, SLOT conducts few optimization steps at test-time to update a light-weight sample-specific parameter vector. It is added to the final hidden layer before the output head, and enables efficient adaptation by caching the last layer features during per-sample optimization. By minimizing the cross-entropy loss on the input prompt only, SLOT helps the model better aligned with and follow each given instruction. In experiments, we demonstrate that our method outperforms the compared models across multiple benchmarks and LLMs. For example, Qwen2.5-7B with SLOT achieves an accuracy gain of 8.6% on GSM8K from 57.54% to 66.19%, while DeepSeek-R1-Distill-Llama-70B with SLOT achieves a SOTA accuracy of 68.69% on GPQA among 70B-level models. Our code is available at https://github.com/maple-research-lab/SLOT.
中文: SLOT是一种参数高效的测试时优化方法,通过更新轻量级参数向量来提升语言模型对单个提示的响应准确性,在多个基准测试中实现了显著的性能提升。
English: SLOT is a parameter-efficient test-time optimization method that enhances language models' accuracy on individual prompts by updating a lightweight parameter vector, achieving significant performance gains across multiple benchmarks.

Authors:Qizhou Chen, Dakan Wang, Taolin Zhang, Zaoming Yan, Chengsong You, Chengyu Wang, Xiaofeng He
Title: UniEdit: A Unified Knowledge Editing Benchmark for Large Language Models
Abstract:
Model editing aims to enhance the accuracy and reliability of large language models (LLMs) by efficiently adjusting their internal parameters. Currently, most LLM editing datasets are confined to narrow knowledge domains and cover a limited range of editing evaluation. They often overlook the broad scope of editing demands and the diversity of ripple effects resulting from edits. In this context, we introduce UniEdit, a unified benchmark for LLM editing grounded in open-domain knowledge. First, we construct editing samples by selecting entities from 25 common domains across five major categories, utilizing the extensive triple knowledge available in open-domain knowledge graphs to ensure comprehensive coverage of the knowledge domains. To address the issues of generality and locality in editing, we design an Neighborhood Multi-hop Chain Sampling (NMCS) algorithm to sample subgraphs based on a given knowledge piece to entail comprehensive ripple effects to evaluate. Finally, we employ proprietary LLMs to convert the sampled knowledge subgraphs into natural language text, guaranteeing grammatical accuracy and syntactical diversity. Extensive statistical analysis confirms the scale, comprehensiveness, and diversity of our UniEdit benchmark. We conduct comprehensive experiments across multiple LLMs and editors, analyzing their performance to highlight strengths and weaknesses in editing across open knowledge domains and various evaluation criteria, thereby offering valuable insights for future research endeavors.
中文: UniEdit是一个基于开放领域知识的统一基准,通过涵盖广泛知识领域并采用创新的邻域多跳链采样算法来评估编辑的连锁效应,以提升大语言模型编辑的全面性和多样性。
English: UniEdit is a unified benchmark designed to improve large language model editing by providing comprehensive coverage across diverse open-domain knowledge and evaluating the ripple effects of edits through a novel sampling algorithm.

Authors:Xinye Li, Mingqi Wan, Dianbo Sui
Title: LLMSR@XLLM25: An Empirical Study of LLM for Structural Reasoning
Abstract:
We present Team asdfo123's submission to the LLMSR@XLLM25 shared task, which evaluates large language models on producing fine-grained, controllable, and interpretable reasoning processes. Systems must extract all problem conditions, decompose a chain of thought into statement-evidence pairs, and verify the logical validity of each pair. Leveraging only the off-the-shelf Meta-Llama-3-8B-Instruct, we craft a concise few-shot, multi-turn prompt that first enumerates all conditions and then guides the model to label, cite, and adjudicate every reasoning step. A lightweight post-processor based on regular expressions normalises spans and enforces the official JSON schema. Without fine-tuning, external retrieval, or ensembling, our method ranks 5th overall, achieving macro F1 scores on par with substantially more complex and resource-consuming pipelines. We conclude by analysing the strengths and limitations of our approach and outlining directions for future research in structural reasoning with LLMs. Our code is available at https://github.com/asdfo123/LLMSR-asdfo123.
中文: asdfo123团队在LLMSR@XLLM25任务中,仅使用Meta-Llama-3-8B-Instruct模型通过多轮提示实现了无需微调的结构化推理,最终排名第五,与更复杂的系统性能相当。
English: Team asdfo123's submission for the LLMSR@XLLM25 task uses Meta-Llama-3-8B-Instruct with a multi-turn prompt to achieve competitive results in structural reasoning without fine-tuning or external resources, ranking 5th overall.

Authors:Maoyuan Ye, Jing Zhang, Juhua Liu, Bo Du, Dacheng Tao
Title: LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
Abstract:
Recent advances in Large Multimodal Models (LMMs) have significantly improved their reasoning and Optical Character Recognition (OCR) capabilities. However, their performance on complex logical reasoning tasks involving text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 1,100 multiple-choice questions designed to evaluate LMMs' logical reasoning abilities on text-rich images, while minimizing reliance on domain-specific knowledge (e.g., mathematics). We construct LogicOCR by curating a text corpus from the Chinese National Civil Servant Examination and develop a scalable, automated pipeline to convert it into multimodal samples. First, we design prompt templates to steer GPT-Image-1 to generate images with diverse backgrounds, interleaved text-illustration layouts, and varied fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified, with low-quality examples discarded. We evaluate a range of representative open-source and proprietary LMMs under both Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. We hope LogicOCR will serve as a valuable resource for advancing multimodal reasoning research. The dataset is available at https://github.com/MiliLab/LogicOCR.
中文: LogicOCR是一个包含1,100道选择题的新基准,旨在评估大型多模态模型在文本丰富图像上的逻辑推理能力,揭示了尽管OCR技术有所进步,但它们在视觉阅读与推理结合方面仍存在不足。
English: LogicOCR is a new benchmark with 1,100 multiple-choice questions designed to assess large multimodal models' logical reasoning on text-rich images, revealing their limitations in integrating visual reading with reasoning despite advances in OCR.

Authors:Md. Atiqur Rahman, Sabrina Islam, Mushfiqul Haque Omi
Title: LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark
Abstract:
Evaluating machine translation (MT) for low-resource languages poses a persistent challenge, primarily due to the limited availability of high quality reference translations. This issue is further exacerbated in languages with multiple dialects, where linguistic diversity and data scarcity hinder robust evaluation. Large Language Models (LLMs) present a promising solution through reference-free evaluation techniques; however, their effectiveness diminishes in the absence of dialect-specific context and tailored guidance. In this work, we propose a comprehensive framework that enhances LLM-based MT evaluation using a dialect guided approach. We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers. To address the vocabulary gap, we augment the tokenizer vocabulary with dialect-specific terms. We further introduce a regression head to enable scalar score prediction and design a dialect-guided (DG) prompting strategy. Our evaluation across multiple LLMs shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation, along with improvements across other evaluation settings. The dataset and the code are available at https://github.com/180041123-Atiq/MTEonLowResourceLanguage.
中文: 本研究提出了一种方言引导的框架,通过整合方言特定数据和提示策略,增强了基于大语言模型的低资源语言机器翻译评估,在相关性指标上取得了显著提升。
English: This study introduces a dialect-guided framework that enhances LLM-based machine translation evaluation for low-resource languages by incorporating dialect-specific data and prompting strategies, achieving significant improvements in correlation metrics.

Authors:Quanjiang Guo, Jinchuan Zhang, Sijie Wang, Ling Tian, Zhao Kang, Bin Yan, Weidong Xiao
Title: Bridging Generative and Discriminative Learning: Few-Shot Relation Extraction via Two-Stage Knowledge-Guided Pre-training
Abstract:
Few-Shot Relation Extraction (FSRE) remains a challenging task due to the scarcity of annotated data and the limited generalization capabilities of existing models. Although large language models (LLMs) have demonstrated potential in FSRE through in-context learning (ICL), their general-purpose training objectives often result in suboptimal performance for task-specific relation extraction. To overcome these challenges, we propose TKRE (Two-Stage Knowledge-Guided Pre-training for Relation Extraction), a novel framework that synergistically integrates LLMs with traditional relation extraction models, bridging generative and discriminative learning paradigms. TKRE introduces two key innovations: (1) leveraging LLMs to generate explanation-driven knowledge and schema-constrained synthetic data, addressing the issue of data scarcity; and (2) a two-stage pre-training strategy combining Masked Span Language Modeling (MSLM) and Span-Level Contrastive Learning (SCL) to enhance relational reasoning and generalization. Together, these components enable TKRE to effectively tackle FSRE tasks. Comprehensive experiments on benchmark datasets demonstrate the efficacy of TKRE, achieving new state-of-the-art performance in FSRE and underscoring its potential for broader application in low-resource scenarios. \footnote{The code and data are released on https://github.com/UESTC-GQJ/TKRE.
中文: 提出的TKRE框架通过解释驱动的知识生成和两阶段预训练策略,将大语言模型与传统关系抽取技术相结合,有效解决数据稀缺问题并提升泛化能力,在小样本关系抽取任务中实现了最先进的性能。
English: The proposed TKRE framework integrates large language models with traditional relation extraction techniques through explanation-driven knowledge generation and a two-stage pre-training strategy, achieving state-of-the-art performance in Few-Shot Relation Extraction by effectively addressing data scarcity and enhancing generalization.

Authors:Shaobo Wang, Xiangqi Jin, Ziming Wang, Jize Wang, Jiajun Zhang, Kaixin Li, Zichen Wen, Zhong Li, Conghui He, Xuming Hu, Linfeng Zhang
Title: Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning
Abstract:
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment. As dataset sizes grow, efficiently selecting optimal subsets for training becomes crucial to balancing performance and computational costs. Traditional data selection methods often require fine-tuning a scoring model on the target dataset, which is time-consuming and resource-intensive, or rely on heuristics that fail to fully leverage the model's predictive capabilities. To address these challenges, we propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned. Comprehensive evaluations were conducted on both raw and synthetic datasets across diverse tasks and models. Notably, Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$\times$ speedup. The code is available at https://github.com/gszfwsb/Data-Whisperer.
Chinese: Data Whisperer 是一种高效、无需训练的基于注意力机制的少样本上下文学习方法,能够为大型语言模型微调选择最优数据子集,以更少的数据和更快的速度超越现有方法的性能。
English: Data Whisperer is an efficient, training-free method that uses attention-based, few-shot in-context learning to select optimal data subsets for fine-tuning LLMs, achieving superior performance with significantly less data and faster speeds than existing approaches.

Authors:Omar Choukrani, Idriss Malek, Daniil Orel, Zhuohan Xie, Zangir Iklassov, Martin Takáč, Salem Lahlou
Title: LLM-BABYBENCH: Understanding and Evaluating Grounded Planning and Reasoning in LLMs
Abstract:
Assessing the capacity of Large Language Models (LLMs) to plan and reason within the constraints of interactive environments is crucial for developing capable AI agents. We introduce $\textbf{LLM-BabyBench}$, a new benchmark suite designed specifically for this purpose. Built upon a textual adaptation of the procedurally generated BabyAI grid world, this suite evaluates LLMs on three fundamental aspects of grounded intelligence: (1) predicting the consequences of actions on the environment state ($\textbf{Predict}$ task), (2) generating sequences of low-level actions to achieve specified objectives ($\textbf{Plan}$ task), and (3) decomposing high-level instructions into coherent subgoal sequences ($\textbf{Decompose}$ task). We detail the methodology for generating the three corresponding datasets ($\texttt{LLM-BabyBench-Predict}$, $\texttt{-Plan}$, $\texttt{-Decompose}$) by extracting structured information from an expert agent operating within the text-based environment. Furthermore, we provide a standardized evaluation harness and metrics, including environment interaction for validating generated plans, to facilitate reproducible assessment of diverse LLMs. Initial baseline results highlight the challenges posed by these grounded reasoning tasks. The benchmark suite, datasets, data generation code, and evaluation code are made publicly available ($\href{https://github.com/choukrani/llm-babybench}{\text{GitHub}}$, $\href{https://huggingface.co/datasets/salem-mbzuai/LLM-BabyBench}{\text{HuggingFace}}$).
中文: 本文提出LLM-BabyBench基准套件,基于文本版BabyAI网格世界评估大语言模型在预测行动结果、规划行动序列和分解指令方面的基础推理能力,并公开了数据集与评估工具。
English: This paper introduces LLM-BabyBench, a benchmark suite built on a text-based BabyAI grid world to evaluate LLMs' grounded reasoning abilities in predicting action consequences, planning action sequences, and decomposing instructions, with datasets and evaluation tools made publicly available.

Authors:Tiannuo Yang, Zebin Yao, Bowen Jin, Lixiao Cui, Yusen Li, Gang Wang, Xiaoguang Liu
Title: Demystifying and Enhancing the Efficiency of Large Language Model Based Search Agents
Abstract:
Large Language Model (LLM)-based search agents have shown remarkable capabilities in solving complex tasks by dynamically decomposing problems and addressing them through interleaved reasoning and retrieval. However, this interleaved paradigm introduces substantial efficiency bottlenecks. First, we observe that both highly accurate and overly approximate retrieval methods degrade system efficiency: exact search incurs significant retrieval overhead, while coarse retrieval requires additional reasoning steps during generation. Second, we identify inefficiencies in system design, including improper scheduling and frequent retrieval stalls, which lead to cascading latency -- where even minor delays in retrieval amplify end-to-end inference time. To address these challenges, we introduce SearchAgent-X, a high-efficiency inference framework for LLM-based search agents. SearchAgent-X leverages high-recall approximate retrieval and incorporates two key techniques: priority-aware scheduling and non-stall retrieval. Extensive experiments demonstrate that SearchAgent-X consistently outperforms state-of-the-art systems such as vLLM and HNSW-based retrieval across diverse tasks, achieving up to 3.4$\times$ higher throughput and 5$\times$ lower latency, without compromising generation quality. SearchAgent-X is available at https://github.com/tiannuo-yang/SearchAgent-X.
中文: 基于大语言模型的搜索代理因检索方法和系统设计存在效率瓶颈,而SearchAgent-X通过高召回近似检索、优先级调度和无中断检索技术,在不影响生成质量的前提下大幅提升了吞吐量并降低了延迟。
English: LLM-based search agents face efficiency issues from retrieval methods and system design, which SearchAgent-X addresses using high-recall retrieval, priority-aware scheduling, and non-stall retrieval to significantly boost throughput and reduce latency without sacrificing quality.

Authors:Peng Ding, Jun Kuang, Zongyu Wang, Xuezhi Cao, Xunliang Cai, Jiajun Chen, Shujian Huang
Title: Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement
Abstract:
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks. In this paper, we identify a critical safety gap: while LLMs are adept at detecting jailbreak prompts, they often produce unsafe responses when directly processing these inputs. Inspired by this insight, we propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability. SAGE consists of two core components: a Discriminative Analysis Module and a Discriminative Response Module, enhancing resilience against sophisticated jailbreak attempts through flexible safety discrimination instructions. Extensive experiments demonstrate SAGE's effectiveness and robustness across various open-source and closed-source LLMs of different sizes and architectures, achieving an average 99% defense success rate against numerous complex and covert jailbreak methods while maintaining helpfulness on general benchmarks. We further conduct mechanistic interpretability analysis through hidden states and attention distributions, revealing the underlying mechanisms of this detection-generation discrepancy. Our work thus contributes to developing future LLMs with coherent safety awareness and generation behavior. Our code and datasets are publicly available at https://github.com/NJUNLP/SAGE.
中文摘要:本文提出无需训练的防御策略SAGE,通过协调大语言模型强大的安全识别能力与相对薄弱的安全生成能力,有效抵御复杂越狱攻击并保持通用帮助性,实现了高达99%的平均防御成功率。
English Summary: The paper introduces SAGE, a training-free defense strategy that enhances LLM safety by aligning their strong jailbreak detection with improved response generation, achieving high defense rates against sophisticated attacks while maintaining general helpfulness.

Authors:Xuanle Zhao, Xuexin Liu, Haoyue Yang, Xianzhen Luo, Fanhu Zeng, Jianling Li, Qi Shi, Chi Chen
Title: ChartEdit: How Far Are MLLMs From Automating Chart Analysis? Evaluating MLLMs' Capability via Chart Editing
Abstract:
Although multimodal large language models (MLLMs) show promise in generating chart rendering code, editing charts via code presents a greater challenge. This task demands MLLMs to integrate chart understanding and reasoning capacities, which are labor-intensive. While many MLLMs claim such editing capabilities, current evaluations rely on limited case studies, highlighting the urgent need for a comprehensive evaluation framework. In this work, we propose \textsc{ChartEdit}, a novel benchmark designed for chart editing tasks, featuring $1405$ diverse editing instructions applied to $233$ real-world charts, each manually annotated and validated for accuracy. Utilizing \textsc{ChartEdit}, we evaluate the performance of 10 mainstream MLLMs across two types of experiments at both the code and chart levels. The results suggest that large-scale models can generate code to produce images that partially match the reference images. However, their ability to generate accurate edits according to the instructions remains limited. The state-of-the-art (SOTA) model achieves a score of only $59.96$, highlighting significant challenges in precise modification. In contrast, small-scale models, including chart-domain models, struggle both with following editing instructions and generating overall chart images, underscoring the need for further development in this area. Code is available at https://github.com/xxlllz/ChartEdit.
Chinese: 提出的ChartEdit基准测试评估了多模态大语言模型在图表编辑任务中的表现,发现大型模型虽能部分匹配参考图像,但在精确修改方面存在困难,而小型模型在遵循指令和图表生成方面面临更大挑战。
English: The proposed ChartEdit benchmark evaluates multimodal large language models (MLLMs) on chart editing tasks, revealing that while large models can partially match reference images, they struggle with precise modifications, and small models face even greater challenges in instruction-following and chart generation.

Authors:Yuyao Zhang, Zhicheng Dou, Xiaoxi Li, Jiajie Jin, Yongkang Wu, Zhonghua Li, Qi Ye, Ji-Rong Wen
Title: Neuro-Symbolic Query Compiler
Abstract:
Precise recognition of search intent in Retrieval-Augmented Generation (RAG) systems remains a challenging goal, especially under resource constraints and for complex queries with nested structures and dependencies. This paper presents QCompiler, a neuro-symbolic framework inspired by linguistic grammar rules and compiler design, to bridge this gap. It theoretically designs a minimal yet sufficient Backus-Naur Form (BNF) grammar $G[q]$ to formalize complex queries. Unlike previous methods, this grammar maintains completeness while minimizing redundancy. Based on this, QCompiler includes a Query Expression Translator, a Lexical Syntax Parser, and a Recursive Descent Processor to compile queries into Abstract Syntax Trees (ASTs) for execution. The atomicity of the sub-queries in the leaf nodes ensures more precise document retrieval and response generation, significantly improving the RAG system's ability to address complex queries.
中文: 本文提出QCompiler这一神经符号框架,通过最小化的巴科斯范式语法形式化复杂查询,并将其编译为抽象语法树,显著提升了RAG系统处理复杂查询时文档检索与响应生成的精确度。
English: This paper introduces QCompiler, a neuro-symbolic framework that formalizes complex queries using a minimal Backus-Naur Form grammar and compiles them into Abstract Syntax Trees to enhance document retrieval and response generation in RAG systems.

Authors:Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang
Title: EAMET: Robust Massive Model Editing via Embedding Alignment Optimization
Abstract:
Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics. Their robustness is also limited in context-rich settings or when editing multiple facts of the same subject simultaneously. We attribute these failures to the embedding misalignment among knowledge items, which undermines editing reliability at scale. To address this, we propose EAMET (Embedding Alignment Model Editing in Transformers), which addresses this issue by aligning the space of key and residual embeddings. Extensive experiments across six LLMs and three datasets demonstrate that EAMET consistently outperforms existing methods, achieving about 90\% editing efficacy when editing 10k facts. Codes and datasets are publicly available at https://ybdai7.github.io/eamet-page/.
Chinese: 针对大规模编辑场景下模型编辑技术效果下降的问题,提出了EAMET方法,通过对齐嵌入空间显著提升了编辑效能和鲁棒性,在多项实验中实现约90%的编辑成功率。
English: Model editing techniques for updating knowledge in large language models face challenges in massive editing scenarios, leading to the proposed EAMET method that aligns embeddings to enhance efficacy and robustness, achieving about 90% effectiveness in extensive experiments.

Authors:Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He
Title: Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs
Abstract:
The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.
中文摘要:Video-SafetyBench作为首个全面评估大型视觉语言模型在视频文本攻击下安全性的基准,通过2264个测试样本和新颖评估指标,揭示了模型在视频诱导攻击中的显著脆弱性。
English Summary: Video-SafetyBench is introduced as the first comprehensive benchmark to evaluate Large Vision-Language Models' safety against video-text attacks, revealing significant vulnerabilities through 2,264 test pairs and a novel evaluation metric.

Authors:Hongliang Li, Jinan Xu, Gengping Cui, Changhao Guan, Fengran Mo, Kaiyu Huang
Title: Multilingual Collaborative Defense for Large Language Models
Abstract:
The robustness and security of large language models (LLMs) has become a prominent research area. One notable vulnerability is the ability to bypass LLM safeguards by translating harmful queries into rare or underrepresented languages, a simple yet effective method of "jailbreaking" these models. Despite the growing concern, there has been limited research addressing the safeguarding of LLMs in multilingual scenarios, highlighting an urgent need to enhance multilingual safety. In this work, we investigate the correlation between various attack features across different languages and propose Multilingual Collaborative Defense (MCD), a novel learning method that optimizes a continuous, soft safety prompt automatically to facilitate multilingual safeguarding of LLMs. The MCD approach offers three advantages: First, it effectively improves safeguarding performance across multiple languages. Second, MCD maintains strong generalization capabilities while minimizing false refusal rates. Third, MCD mitigates the language safety misalignment caused by imbalances in LLM training corpora. To evaluate the effectiveness of MCD, we manually construct multilingual versions of commonly used jailbreak benchmarks, such as MaliciousInstruct and AdvBench, to assess various safeguarding methods. Additionally, we introduce these datasets in underrepresented (zero-shot) languages to verify the language transferability of MCD. The results demonstrate that MCD outperforms existing approaches in safeguarding against multilingual jailbreak attempts while also exhibiting strong language transfer capabilities. Our code is available at https://github.com/HLiang-Lee/MCD.
中文摘要:本文提出多语言协同防御(MCD)方法,通过自动优化安全提示来增强大语言模型的多语言防护能力,实验证明该方法在抵御跨语言越狱攻击方面优于现有方案,并展现出优秀的语言迁移性能。
English Summary: This paper introduces Multilingual Collaborative Defense (MCD), a novel method that automatically optimizes safety prompts to protect large language models from multilingual jailbreak attacks, demonstrating superior performance and transferability across languages compared to existing approaches.

Authors:Yansong Ning, Wei Li, Jun Fang, Naiqiang Tan, Hao Liu
Title: Not All Thoughts are Generated Equal: Efficient LLM Reasoning via Multi-Turn Reinforcement Learning
Abstract:
Compressing long chain-of-thought (CoT) from large language models (LLMs) is an emerging strategy to improve the reasoning efficiency of LLMs. Despite its promising benefits, existing studies equally compress all thoughts within a long CoT, hindering more concise and effective reasoning. To this end, we first investigate the importance of different thoughts by examining their effectiveness and efficiency in contributing to reasoning through automatic long CoT chunking and Monte Carlo rollouts. Building upon the insights, we propose a theoretically bounded metric to jointly measure the effectiveness and efficiency of different thoughts. We then propose Long$\otimes$Short, an efficient reasoning framework that enables two LLMs to collaboratively solve the problem: a long-thought LLM for more effectively generating important thoughts, while a short-thought LLM for efficiently generating remaining thoughts. Specifically, we begin by synthesizing a small amount of cold-start data to fine-tune LLMs for long-thought and short-thought reasoning styles, respectively. Furthermore, we propose a synergizing-oriented multi-turn reinforcement learning, focusing on the model self-evolution and collaboration between long-thought and short-thought LLMs. Experimental results show that our method enables Qwen2.5-7B and Llama3.1-8B to achieve comparable performance compared to DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B, while reducing token length by over 80% across the MATH500, AIME24/25, AMC23, and GPQA Diamond benchmarks. Our data and code are available at https://github.com/usail-hkust/LongShort.
中文:本文提出Long⊗Short协同推理框架,通过长思考与短思考模型分别生成关键思路和辅助思路,在多个基准测试中实现与现有模型相当性能的同时,将推理长度减少80%以上。
English: This paper introduces Long⊗Short, a collaborative reasoning framework where two LLMs generate important and remaining thoughts respectively, achieving comparable performance to existing models while reducing token usage by over 80% across multiple benchmarks.

Authors:Kaitao Song, Xiaohua Wang, Xu Tan, Huiqiang Jiang, Chengruidong Zhang, Yongliang Shen, Cen LU, Zihao Li, Zifan Song, Caihua Shan, Yansen Wang, Kan Ren, Xiaoqing Zheng, Tao Qin, Yuqing Yang, Dongsheng Li, Lili Qiu
Title: Chain-of-Model Learning for Language Model
Abstract:
In this paper, we propose a novel learning paradigm, termed Chain-of-Model (CoM), which incorporates the causal relationship into the hidden states of each layer as a chain style, thereby introducing great scaling efficiency in model training and inference flexibility in deployment. We introduce the concept of Chain-of-Representation (CoR), which formulates the hidden states at each layer as a combination of multiple sub-representations (i.e., chains) at the hidden dimension level. In each layer, each chain from the output representations can only view all of its preceding chains in the input representations. Consequently, the model built upon CoM framework can progressively scale up the model size by increasing the chains based on the previous models (i.e., chains), and offer multiple sub-models at varying sizes for elastic inference by using different chain numbers. Based on this principle, we devise Chain-of-Language-Model (CoLM), which incorporates the idea of CoM into each layer of Transformer architecture. Based on CoLM, we further introduce CoLM-Air by introducing a KV sharing mechanism, that computes all keys and values within the first chain and then shares across all chains. This design demonstrates additional extensibility, such as enabling seamless LM switching, prefilling acceleration and so on. Experimental results demonstrate our CoLM family can achieve comparable performance to the standard Transformer, while simultaneously enabling greater flexiblity, such as progressive scaling to improve training efficiency and offer multiple varying model sizes for elastic inference, paving a a new way toward building language models. Our code will be released in the future at: https://github.com/microsoft/CoLM.
中文: 本文提出链式模型(CoM)这一新型学习范式,通过将隐藏状态构建为因果链实现渐进式模型扩展和弹性推理,在保持与标准Transformer相当性能的同时显著提升了训练效率和部署灵活性。
English: This paper introduces Chain-of-Model (CoM), a novel learning paradigm that enhances training efficiency and inference flexibility by structuring hidden states into causal chains, enabling progressive model scaling and elastic deployment.

Authors:Yang Tan, Wenrui Gou, Bozitao Zhong, Liang Hong, Huiqun Yu, Bingxin Zhou
Title: VenusX: Unlocking Fine-Grained Functional Understanding of Proteins
Abstract:
Deep learning models have driven significant progress in predicting protein function and interactions at the protein level. While these advancements have been invaluable for many biological applications such as enzyme engineering and function annotation, a more detailed perspective is essential for understanding protein functional mechanisms and evaluating the biological knowledge captured by models. To address this demand, we introduce VenusX, the first large-scale benchmark for fine-grained functional annotation and function-based protein pairing at the residue, fragment, and domain levels. VenusX comprises three major task categories across six types of annotations, including residue-level binary classification, fragment-level multi-class classification, and pairwise functional similarity scoring for identifying critical active sites, binding sites, conserved sites, motifs, domains, and epitopes. The benchmark features over 878,000 samples curated from major open-source databases such as InterPro, BioLiP, and SAbDab. By providing mixed-family and cross-family splits at three sequence identity thresholds, our benchmark enables a comprehensive assessment of model performance on both in-distribution and out-of-distribution scenarios. For baseline evaluation, we assess a diverse set of popular and open-source models, including pre-trained protein language models, sequence-structure hybrids, structure-based methods, and alignment-based techniques. Their performance is reported across all benchmark datasets and evaluation settings using multiple metrics, offering a thorough comparison and a strong foundation for future research. Code and data are publicly available at https://github.com/ai4protein/VenusX.
中文: VenusX作为首个在残基、片段和结构域层面进行精细蛋白质功能注释和配对的大规模基准,通过涵盖多种任务和场景,为模型评估提供了全面支持。
English: VenusX is introduced as the first large-scale benchmark for fine-grained protein functional annotation and pairing at residue, fragment, and domain levels, enabling comprehensive evaluation of models across diverse tasks and scenarios.

Authors:Raymond Baartmans, Matthew Raffel, Rahul Vikram, Aiden Deringer, Lizhong Chen
Title: Towards Universal Semantics With Large Language Models
Abstract:
The Natural Semantic Metalanguage (NSM) is a linguistic theory based on a universal set of semantic primes: simple, primitive word-meanings that have been shown to exist in most, if not all, languages of the world. According to this framework, any word, regardless of complexity, can be paraphrased using these primes, revealing a clear and universally translatable meaning. These paraphrases, known as explications, can offer valuable applications for many natural language processing (NLP) tasks, but producing them has traditionally been a slow, manual process. In this work, we present the first study of using large language models (LLMs) to generate NSM explications. We introduce automatic evaluation methods, a tailored dataset for training and evaluation, and fine-tuned models for this task. Our 1B and 8B models outperform GPT-4o in producing accurate, cross-translatable explications, marking a significant step toward universal semantic representation with LLMs and opening up new possibilities for applications in semantic analysis, translation, and beyond. Our code is available at https://github.com/OSU-STARLAB/DeepNSM.
中文: 本研究首次利用大语言模型自动生成自然语义元语言释义,优化后的模型在准确性和跨语言可译性上超越GPT-4o,为自然语言处理的语义表征开辟了新途径。
English: This study pioneers the use of large language models to automatically generate Natural Semantic Metalanguage explications, with fine-tuned models outperforming GPT-4o in accuracy and cross-translatability, advancing universal semantic representation for NLP applications.

Authors:Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Title: Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation
Abstract:
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant information but also employing multi-hop reasoning across the information sources. Although LMs perform well on traditional question-answering tasks, the causal mask can hinder their capacity to reason across complex contexts. In this paper, we explore how LMs respond to multi-hop questions by permuting search results (retrieved documents) under various configurations. Our study reveals interesting findings as follows: 1) Encoder-decoder models, such as the ones in the Flan-T5 family, generally outperform causal decoder-only LMs in MHQA tasks, despite being significantly smaller in size; 2) altering the order of gold documents reveals distinct trends in both Flan T5 models and fine-tuned decoder-only models, with optimal performance observed when the document order aligns with the reasoning chain order; 3) enhancing causal decoder-only models with bi-directional attention by modifying the causal mask can effectively boost their end performance. In addition to the above, we conduct a thorough investigation of the distribution of LM attention weights in the context of MHQA. Our experiments reveal that attention weights tend to peak at higher values when the resulting answer is correct. We leverage this finding to heuristically improve LMs' performance on this task. Our code is publicly available at https://github.com/hwy9855/MultiHopQA-Reasoning.
中文: 多跳问答因因果掩码对语言模型构成挑战,但Flan-T5等编码器-解码器模型优于较小的仅解码器模型,当文档顺序与推理链一致且注意力权重在正确答案时达到峰值时,性能显著提升。
English: Multi-hop question answering presents challenges for language models due to causal masks, but encoder-decoder models like Flan-T5 outperform smaller decoder-only models, with performance improving when document order matches reasoning chains and attention weights peak for correct answers.

Authors:Kevin Wu, Eric Wu, Rahul Thapa, Kevin Wei, Angela Zhang, Arvind Suresh, Jacqueline J. Tao, Min Woo Sun, Alejandro Lozano, James Zou
Title: MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports
Abstract:
Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.
中文:MedCaseReasoning数据集旨在评估大语言模型与临床医生诊断推理的一致性,发现现有模型存在明显不足,但通过基于临床推理的微调可显著提升诊断准确性和推理还原能力。
English: The MedCaseReasoning dataset is introduced to evaluate LLMs' alignment with clinician diagnostic reasoning, revealing significant gaps in current models' performance but showing substantial improvement through fine-tuning on clinical reasoning traces.

Authors:Shun Inadumi, Nobuhiro Ueda, Koichiro Yoshino
Title: Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Abstract:
Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.
中文摘要:本文提出一个统一文本和多模态指代消解的框架,通过将指称映射到对象并利用相似性来增强短语定位,实验表明该框架在代词消解等任务上优于MDETR和GLIP等模型,有效减少视觉对话中的歧义。
English Summary: This paper introduces a unified framework for textual and multimodal reference resolution that enhances phrase grounding by mapping mentions to objects and leveraging similarities, with experiments showing improved performance, especially in pronoun resolution, over models like MDETR and GLIP.

Authors:Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Title: SoftCoT++: Test-Time Scaling with Soft Chain-of-Thought Reasoning
Abstract:
Test-Time Scaling (TTS) refers to approaches that improve reasoning performance by allocating extra computation during inference, without altering the model's parameters. While existing TTS methods operate in a discrete token space by generating more intermediate steps, recent studies in Coconut and SoftCoT have demonstrated that thinking in the continuous latent space can further enhance the reasoning performance. Such latent thoughts encode informative thinking without the information loss associated with autoregressive token generation, sparking increased interest in continuous-space reasoning. Unlike discrete decoding, where repeated sampling enables exploring diverse reasoning paths, latent representations in continuous space are fixed for a given input, which limits diverse exploration, as all decoded paths originate from the same latent thought. To overcome this limitation, we introduce SoftCoT++ to extend SoftCoT to the Test-Time Scaling paradigm by enabling diverse exploration of thinking paths. Specifically, we perturb latent thoughts via multiple specialized initial tokens and apply contrastive learning to promote diversity among soft thought representations. Experiments across five reasoning benchmarks and two distinct LLM architectures demonstrate that SoftCoT++ significantly boosts SoftCoT and also outperforms SoftCoT with self-consistency scaling. Moreover, it shows strong compatibility with conventional scaling techniques such as self-consistency. Source code is available at https://github.com/xuyige/SoftCoT.
中文:SoftCoT++ 通过扰动和对比学习实现潜在思维路径的多样化探索,在不改变模型参数的情况下,显著提升了多个推理基准的性能并优于现有方法。
English: SoftCoT++ enhances reasoning by diversifying latent thought exploration through perturbations and contrastive learning, outperforming existing methods across multiple benchmarks without altering model parameters.

Authors:Yiming Lei, Chenkai Zhang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Title: GODBench: A Benchmark for Multimodal Large Language Models in Video Comment Art
Abstract:
Video Comment Art enhances user engagement by providing creative content that conveys humor, satire, or emotional resonance, requiring a nuanced and comprehensive grasp of cultural and contextual subtleties. Although Multimodal Large Language Models (MLLMs) and Chain-of-Thought (CoT) have demonstrated strong reasoning abilities in STEM tasks (e.g. mathematics and coding), they still struggle to generate creative expressions such as resonant jokes and insightful satire. Moreover, existing benchmarks are constrained by their limited modalities and insufficient categories, hindering the exploration of comprehensive creativity in video-based Comment Art creation. To address these limitations, we introduce GODBench, a novel benchmark that integrates video and text modalities to systematically evaluate MLLMs' abilities to compose Comment Art. Furthermore, inspired by the propagation patterns of waves in physics, we propose Ripple of Thought (RoT), a multi-step reasoning framework designed to enhance the creativity of MLLMs. Extensive experiments reveal that existing MLLMs and CoT methods still face significant challenges in understanding and generating creative video comments. In contrast, RoT provides an effective approach to improve creative composing, highlighting its potential to drive meaningful advancements in MLLM-based creativity. GODBench is publicly available at https://github.com/stan-lei/GODBench-ACL2025.
中文: GODBench是一个结合视频与文本的新基准,用于评估多模态大语言模型创作视频弹幕的能力,而提出的涟漪思维框架通过改进幽默与讽刺生成的现有局限,显著提升了模型的创意表达水平。
English: GODBench is a new benchmark combining video and text to evaluate multimodal large language models' ability to create engaging video comments, while the proposed Ripple of Thought framework enhances their creative expression by addressing current limitations in humor and satire generation.

Authors:Bohao Xing, Xin Liu, Guoying Zhao, Chengyu Liu, Xiaolan Fu, Heikki Kälviäinen
Title: EmotionHallucer: Evaluating Emotion Hallucinations in Multimodal Large Language Models
Abstract:
Emotion understanding is a critical yet challenging task. Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities in this area. However, MLLMs often suffer from hallucinations, generating irrelevant or nonsensical content. To the best of our knowledge, despite the importance of this issue, there has been no dedicated effort to evaluate emotion-related hallucinations in MLLMs. In this work, we introduce EmotionHallucer, the first benchmark for detecting and analyzing emotion hallucinations in MLLMs. Unlike humans, whose emotion understanding stems from the interplay of biology and social learning, MLLMs rely solely on data-driven learning and lack innate emotional instincts. Fortunately, emotion psychology provides a solid foundation of knowledge about human emotions. Building on this, we assess emotion hallucinations from two dimensions: emotion psychology knowledge and real-world multimodal perception. To support robust evaluation, we utilize an adversarial binary question-answer (QA) framework, which employs carefully crafted basic and hallucinated pairs to assess the emotion hallucination tendencies of MLLMs. By evaluating 38 LLMs and MLLMs on EmotionHallucer, we reveal that: i) most current models exhibit substantial issues with emotion hallucinations; ii) closed-source models outperform open-source ones in detecting emotion hallucinations, and reasoning capability provides additional advantages; iii) existing models perform better in emotion psychology knowledge than in multimodal emotion perception. As a byproduct, these findings inspire us to propose the PEP-MEK framework, which yields an average improvement of 9.90% in emotion hallucination detection across selected models. Resources will be available at https://github.com/xxtars/EmotionHallucer.
中文: 本文提出了首个检测多模态大语言模型中情绪幻觉的基准EmotionHallucer,揭示了当前模型普遍存在该问题,并提出的框架使检测性能平均提升9.90%。
English: This paper introduces EmotionHallucer, the first benchmark for detecting emotion hallucinations in Multimodal Large Language Models, revealing widespread issues and proposing a framework that improves detection by 9.90%.

Authors:Yapei Chang, Yekyung Kim, Michael Krumdick, Amir Zadeh, Chuan Li, Chris Tanner, Mohit Iyyer
Title: BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Abstract:
Reward models are central to aligning LLMs with human preferences, but they are costly to train, requiring large-scale human-labeled preference data and powerful pretrained LLM backbones. Meanwhile, the increasing availability of high-quality synthetic instruction-following datasets raises the question: can simpler, reference-based metrics serve as viable alternatives to reward models during RL-based alignment? In this paper, we show first that BLEU, a basic string-matching metric, surprisingly matches strong reward models in agreement with human preferences on general instruction-following datasets. Based on this insight, we develop BLEUBERI, a method that first identifies challenging instructions and then applies Group Relative Policy Optimization (GRPO) using BLEU directly as the reward function. We demonstrate that BLEUBERI-trained models are competitive with models trained via reward model-guided RL across four challenging instruction-following benchmarks and three different base language models. A human evaluation further supports that the quality of BLEUBERI model outputs is on par with those from reward model-aligned models. Moreover, BLEUBERI models generate outputs that are more factually grounded than competing methods. Overall, we show that given access to high-quality reference outputs (easily obtained via existing instruction-following datasets or synthetic data generation), string matching-based metrics are cheap yet effective proxies for reward models during alignment. We release our code and data at https://github.com/lilakk/BLEUBERI.
Chinese: 研究表明,简单的字符串匹配指标如BLEU可有效替代成本高昂的奖励模型来对齐语言模型与人类偏好,提出的BLEUBERI方法直接使用BLEU作为奖励函数,在指令遵循任务中实现与奖励模型指导的强化学习相竞争的性能,同时增强生成内容的事实依据性。
English: The study reveals that simple string-matching metrics like BLEU can effectively replace costly reward models for aligning language models with human preferences, introducing BLEUBERI, a method that leverages BLEU as a reward function to achieve competitive performance in instruction-following tasks while enhancing factual grounding.

Authors:Hao Gu, Jiangyan Yi, Chenglong Wang, Jianhua Tao, Zheng Lian, Jiayi He, Yong Ren, Yujie Chen, Zhengqi Wen
Title: ALLM4ADD: Unlocking the Capabilities of Audio Large Language Models for Audio Deepfake Detection
Abstract:
Audio deepfake detection (ADD) has grown increasingly important due to the rise of high-fidelity audio generative models and their potential for misuse. Given that audio large language models (ALLMs) have made significant progress in various audio processing tasks, a heuristic question arises: \textit{Can ALLMs be leveraged to solve ADD?}. In this paper, we first conduct a comprehensive zero-shot evaluation of ALLMs on ADD, revealing their ineffectiveness. To this end, we propose ALLM4ADD, an ALLM-driven framework for ADD. Specifically, we reformulate ADD task as an audio question answering problem, prompting the model with the question: ``Is this audio fake or real?''. We then perform supervised fine-tuning to enable the ALLM to assess the authenticity of query audio. Extensive experiments are conducted to demonstrate that our ALLM-based method can achieve superior performance in fake audio detection, particularly in data-scarce scenarios. As a pioneering study, we anticipate that this work will inspire the research community to leverage ALLMs to develop more effective ADD systems. Code is available at https://github.com/ucas-hao/qwen_audio_for_add.git
中文摘要:本文提出ALLM4ADD框架,将音频深度伪造检测重新定义为音频问答任务,通过对音频大语言模型进行监督微调,实现了尤其在数据稀缺场景下更优的伪造音频检测性能。
English Summary: This paper introduces ALLM4ADD, a framework that reformulates audio deepfake detection as an audio question answering task and uses supervised fine-tuning of audio large language models to achieve superior performance, especially in data-scarce scenarios.

Authors:Yexiang Liu, Zekun Li, Zhi Fang, Nan Xu, Ran He, Tieniu Tan
Title: Rethinking the Role of Prompting Strategies in LLM Test-Time Scaling: A Perspective of Probability Theory
Abstract:
Recently, scaling test-time compute on Large Language Models (LLM) has garnered wide attention. However, there has been limited investigation of how various reasoning prompting strategies perform as scaling. In this paper, we focus on a standard and realistic scaling setting: majority voting. We systematically conduct experiments on 6 LLMs $\times$ 8 prompting strategies $\times$ 6 benchmarks. Experiment results consistently show that as the sampling time and computational overhead increase, complicated prompting strategies with superior initial performance gradually fall behind simple Chain-of-Thought. We analyze this phenomenon and provide theoretical proofs. Additionally, we propose a probabilistic method to efficiently predict scaling performance and identify the best prompting strategy under large sampling times, eliminating the need for resource-intensive inference processes in practical applications. Furthermore, we introduce two ways derived from our theoretical analysis to significantly improve the scaling performance. We hope that our research can promote to re-examine the role of complicated prompting, unleash the potential of simple prompting strategies, and provide new insights for enhancing test-time scaling performance. Code is available at https://github.com/MraDonkey/rethinking_prompting.
Chinese: 本研究发现在测试时计算扩展过程中,随着资源增加,复杂提示策略会逐渐被简单的思维链方法超越,并提出无需大量推理即可预测最优策略并提升扩展性能的高效方法。
English: This study reveals that as computational resources increase during test-time scaling, complex prompting strategies are outperformed by simple Chain-of-Thought, and proposes efficient methods to predict optimal strategies and enhance scaling performance without extensive inference.

Authors:Mohammadtaha Bagherifard, Sahar Rajabi, Ali Edalat, Yadollah Yaghoobzadeh
Title: GenKnowSub: Improving Modularity and Reusability of LLMs through General Knowledge Subtraction
Abstract:
Large language models often struggle with zero-shot generalization, and several modular approaches have been proposed to address this challenge. Yet, we hypothesize that a key limitation remains: the entanglement of general knowledge and task-specific adaptations. To overcome this, we propose a modular framework that disentangles these components by constructing a library of task-specific LoRA modules alongside a general-domain LoRA. By subtracting this general knowledge component from each task-specific module, we obtain residual modules that focus more exclusively on task-relevant information, a method we call general knowledge subtraction (GenKnowSub). Leveraging the refined task-specific modules and the Arrow routing algorithm \citep{ostapenko2024towards}, we dynamically select and combine modules for new inputs without additional training. Our studies on the Phi-3 model and standard Arrow as baselines reveal that using general knowledge LoRAs derived from diverse languages, including English, French, and German, yields consistent performance gains in both monolingual and cross-lingual settings across a wide set of benchmarks. Further experiments on Phi-2 demonstrate how GenKnowSub generalizes to weaker LLMs. The complete code and data are available at https://github.com/saharsamr/Modular-LLM.
中文: 本文提出GenKnowSub模块化框架,通过从任务特定模块中减去通用知识LoRA来解耦通用知识与任务适配,无需重新训练即可动态组合模块,在多语言和跨语言场景中显著提升零样本泛化性能。
English: This paper introduces GenKnowSub, a modular framework that disentangles general knowledge from task-specific adaptations by subtracting a general-domain LoRA from task-specific modules, enabling dynamic combination for improved zero-shot generalization across languages and benchmarks without retraining.

Authors:Songjun Tu, Jiahao Lin, Qichao Zhang, Xiangyu Tian, Linjing Li, Xiangyuan Lan, Dongbin Zhao
Title: Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL
Abstract:
Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https://github.com/ScienceOne-AI/AutoThink.
Chinese: AutoThink是一种强化学习框架,使大型推理模型能够根据问题复杂度动态选择是否进行显式推理,在提升准确率的同时大幅降低计算消耗。
English: AutoThink is a reinforcement learning framework that enables large reasoning models to dynamically decide when to engage in explicit reasoning, achieving improved accuracy with significantly reduced computational overhead.

Authors:Weiqin Wang, Yile Wang, Hui Huang
Title: Ranked Voting based Self-Consistency of Large Language Models
Abstract:
Majority voting is considered an effective method to enhance chain-of-thought reasoning, as it selects the answer with the highest "self-consistency" among different reasoning paths (Wang et al., 2023). However, previous chain-of-thought reasoning methods typically generate only a single answer in each trial, thereby ignoring the possibility of other potential answers. As a result, these alternative answers are often overlooked in subsequent voting processes. In this work, we propose to generate ranked answers in each reasoning process and conduct ranked voting among multiple ranked answers from different responses, thereby making the overall self-consistency more reliable. Specifically, we use three ranked voting methods: Instant-runoff voting, Borda count voting, and mean reciprocal rank voting. We validate our methods on six datasets, including three multiple-choice and three open-ended question-answering tasks, using both advanced open-source and closed-source large language models. Extensive experimental results indicate that our proposed method outperforms the baselines, showcasing the potential of leveraging the information of ranked answers and using ranked voting to improve reasoning performance. The code is available at https://github.com/szu-tera/RankedVotingSC.
中文: 本研究提出在每次推理过程中生成排序答案并采用排序投票方法,有效提升了思维链推理的可靠性,在多个数据集上的实验结果表明该方法优于现有基准。
English: The proposed method enhances chain-of-thought reasoning by generating ranked answers in each trial and applying ranked voting techniques, which significantly improves reasoning reliability and outperforms existing baselines across multiple datasets.

Authors:Jen-tse Huang, Kaiser Sun, Wenxuan Wang, Mark Dredze
Title: Language Models Do Not Have Human-Like Working Memory
Abstract:
While Large Language Models (LLMs) exhibit remarkable reasoning abilities, we demonstrate that they lack a fundamental aspect of human cognition: working memory. Human working memory is an active cognitive system that enables not only the temporary storage of information but also its processing and utilization, enabling coherent reasoning and decision-making. Without working memory, individuals may produce unrealistic responses, exhibit self-contradictions, and struggle with tasks that require mental reasoning. Existing evaluations using N-back or context-dependent tasks fall short as they allow LLMs to exploit external context rather than retaining the reasoning process in the latent space. We introduce three novel tasks: (1) Number Guessing, (2) Yes-No Deduction, and (3) Math Magic, designed to isolate internal representation from external context. Across seventeen frontier models spanning four major model families, we consistently observe irrational or contradictory behaviors, indicating LLMs' inability to retain and manipulate latent information. Our work establishes a new benchmark for evaluating working memory in LLMs and highlights this limitation as a key bottleneck for advancing reliable reasoning systems. Code and prompts for the experiments are available at https://github.com/penguinnnnn/LLM-Working-Memory.
中文: 该研究揭示大语言模型缺乏人类工作记忆能力,导致推理任务中出现非理性行为,并提出了新的评估基准以检验这一关键认知缺陷。
English: The study reveals that large language models lack human-like working memory, leading to irrational behaviors in reasoning tasks, and introduces new benchmarks to evaluate this critical cognitive limitation.

Authors:Ke Wang, Junting Pan, Linda Wei, Aojun Zhou, Weikang Shi, Zimu Lu, Han Xiao, Yunqiao Yang, Houxing Ren, Mingjie Zhan, Hongsheng Li
Title: MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning
Abstract:
Natural language image-caption datasets, widely used for training Large Multimodal Models, mainly focus on natural scenarios and overlook the intricate details of mathematical figures that are critical for problem-solving, hindering the advancement of current LMMs in multimodal mathematical reasoning. To this end, we propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures, establishing a precise connection between the two modalities. Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach, resulting in an image-to-code model, FigCodifier and ImgCode-8.6M dataset, the largest image-code dataset to date. Furthermore, we utilize FigCodifier to synthesize novel mathematical figures and then construct MM-MathInstruct-3M, a high-quality multimodal math instruction fine-tuning dataset. Finally, we present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving. Our model achieves a new open-source SOTA across all six metrics. Notably, it surpasses GPT-4o and Claude 3.5 Sonnet in the geometry problem-solving subset of MathVista, achieving improvements of 8.9% and 9.2%. The dataset and models will be released at https://github.com/mathllm/MathCoder.
中文: 当前多模态模型因缺乏对数学图形的精细理解而在推理上受限,为此我们利用代码监督和大规模数据集开发了MathCoder-VL模型,在几何问题求解上超越了GPT-4o等模型,实现了开源模型的最优性能。
English: Current multimodal models struggle with mathematical reasoning due to a lack of detailed figure understanding, so we developed MathCoder-VL using code supervision and a large-scale dataset to achieve state-of-the-art performance, surpassing models like GPT-4o in geometry problem-solving.

Authors:Zhiyuan Hu, Yibo Wang, Hanze Dong, Yuhui Xu, Amrita Saha, Caiming Xiong, Bryan Hooi, Junnan Li
Title: Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
Abstract:
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning. Prior work has shown that outcome-based reinforcement learning (RL) can incidentally elicit advanced reasoning behaviors such as self-correction, backtracking, and verification phenomena often referred to as the model's "aha moment". However, the timing and consistency of these emergent behaviors remain unpredictable and uncontrollable, limiting the scalability and reliability of LRMs' reasoning capabilities. To address these limitations, we move beyond reliance on prompts and coincidental "aha moments". Instead, we explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks. Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosting performance by over 10\% relative to instruction-tuned baselines. Furthermore, domain-specific RL from the aligned checkpoint yields an additional gain in performance ceiling for both 7B and 32B models across math, coding, and science benchmarks, demonstrating that explicit meta-ability alignment offers a scalable and dependable foundation for reasoning. Code is available at: https://github.com/zhiyuanhubj/Meta-Ability-Alignment
Chinese: 该研究提出了一种通过三阶段流程将大型推理模型与演绎、归纳和溯因能力明确对齐的方法,使性能提升超过10%,并为跨数学、编程和科学领域的推理提供了可扩展且可靠的基础。
English: The study introduces a method to explicitly align large reasoning models with deduction, induction, and abduction through a three-stage pipeline, enhancing performance by over 10% and providing a scalable, reliable foundation for reasoning across various domains.

Authors:Anastasios Gerontopoulos, Spyros Gidaris, Nikos Komodakis
Title: Multi-Token Prediction Needs Registers
Abstract:
Multi-token prediction has emerged as a promising objective for improving language model pretraining, but its benefits have not consistently generalized to other settings such as fine-tuning. In this paper, we propose MuToR, a simple and effective approach to multi-token prediction that interleaves learnable register tokens into the input sequence, each tasked with predicting future targets. Compared to existing methods, MuToR offers several key advantages: it introduces only a negligible number of additional parameters, requires no architectural changes--ensuring compatibility with off-the-shelf pretrained language models--and remains aligned with the next-token pretraining objective, making it especially well-suited for supervised fine-tuning. Moreover, it naturally supports scalable prediction horizons. We demonstrate the effectiveness and versatility of MuToR across a range of use cases, including supervised fine-tuning, parameter-efficient fine-tuning (PEFT), and pretraining, on challenging generative tasks in both language and vision domains. Our code will be available at: https://github.com/nasosger/MuToR.
中文: MuToR是一种创新的多令牌预测方法,通过在输入序列中插入可学习的寄存器令牌来预测未来目标,具有参数增量极少、无需改变模型架构即可兼容现有预训练模型的特点,在语言和视觉任务的微调与预训练中均表现出优越性能。
English: MuToR is a novel multi-token prediction method that integrates learnable register tokens into input sequences to predict future targets, offering minimal parameter overhead, architectural compatibility with existing models, and enhanced performance across fine-tuning and pretraining scenarios in both language and vision tasks.

Authors:Jiajie Jin, Xiaoxi Li, Guanting Dong, Yuyao Zhang, Yutao Zhu, Yongkang Wu, Zhonghua Li, Qi Ye, Zhicheng Dou
Title: Hierarchical Document Refinement for Long-context Retrieval-augmented Generation
Abstract:
Real-world RAG applications often encounter long-context input scenarios, where redundant information and noise results in higher inference costs and reduced performance. To address these challenges, we propose LongRefiner, an efficient plug-and-play refiner that leverages the inherent structural characteristics of long documents. LongRefiner employs dual-level query analysis, hierarchical document structuring, and adaptive refinement through multi-task learning on a single foundation model. Experiments on seven QA datasets demonstrate that LongRefiner achieves competitive performance in various scenarios while using 10x fewer computational costs and latency compared to the best baseline. Further analysis validates that LongRefiner is scalable, efficient, and effective, providing practical insights for real-world long-text RAG applications. Our code is available at https://github.com/ignorejjj/LongRefiner.
中文: LongRefiner作为高效的即插即用优化器,通过双级查询分析和自适应优化处理长文档中的冗余信息与噪声,在七个问答数据集上以十倍降低的计算成本实现了优异性能。
English: LongRefiner is an efficient plug-and-play refiner that tackles redundant information and noise in long-context RAG applications through dual-level query analysis and adaptive refinement, achieving competitive performance with 10x fewer computational costs across seven QA datasets.

Authors:Yile Wang, Zhanyu Shen, Hui Huang
Title: LDIR: Low-Dimensional Dense and Interpretable Text Embeddings with Relative Representations
Abstract:
Semantic text representation is a fundamental task in the field of natural language processing. Existing text embedding (e.g., SimCSE and LLM2Vec) have demonstrated excellent performance, but the values of each dimension are difficult to trace and interpret. Bag-of-words, as classic sparse interpretable embeddings, suffers from poor performance. Recently, Benara et al. (2024) propose interpretable text embeddings using large language models, which forms "0/1" embeddings based on responses to a series of questions. These interpretable text embeddings are typically high-dimensional (larger than 10,000). In this work, we propose Low-dimensional (lower than 500) Dense and Interpretable text embeddings with Relative representations (LDIR). The numerical values of its dimensions indicate semantic relatedness to different anchor texts through farthest point sampling, offering both semantic representation as well as a certain level of traceability and interpretability. We validate LDIR on multiple semantic textual similarity, retrieval, and clustering tasks. Extensive experimental results show that LDIR performs close to the black-box baseline models and outperforms the interpretable embeddings baselines with much fewer dimensions. Code is available at https://github.com/szu-tera/LDIR.
Chinese: 本文提出LDIR,一种低维稠密且可解释的文本嵌入方法,在保持与黑盒模型相近性能的同时,以更少维度超越了现有可解释基线模型。
English: This paper introduces LDIR, a low-dimensional dense and interpretable text embedding method that achieves performance comparable to black-box models while outperforming existing interpretable baselines with significantly fewer dimensions.

Authors:Haozhe Luo, Ziyu Zhou, Zixin Shu, Aurélie Pahud de Mortanges, Robert Berke, Mauricio Reyes
Title: On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging
Abstract:
Deep neural networks excel in medical imaging but remain prone to biases, leading to fairness gaps across demographic groups. We provide the first systematic exploration of Human-AI alignment and fairness in this domain. Our results show that incorporating human insights consistently reduces fairness gaps and enhances out-of-domain generalization, though excessive alignment can introduce performance trade-offs, emphasizing the need for calibrated strategies. These findings highlight Human-AI alignment as a promising approach for developing fair, robust, and generalizable medical AI systems, striking a balance between expert guidance and automated efficiency. Our code is available at https://github.com/Roypic/Aligner.
中文摘要:在医学影像中,人机对齐能持续缩小公平性差距并提升泛化能力,但过度对齐需采用校准策略以平衡专家指导与自动化效率。
English Summary: Human-AI alignment in medical imaging consistently reduces fairness gaps and improves generalization, though excessive alignment requires calibrated strategies to balance expert guidance with automated efficiency.

Authors:Zixiao Zhu, Hanzhang Zhou, Zijian Feng, Tianjiao Li, Chua Jia Jim Deryl, Mak Lee Onn, Gee Wah Ng, Kezhi Mao
Title: Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Abstract:
Prompt optimization (PO) provides a practical way to improve response quality when users lack the time or expertise to manually craft effective prompts. Existing methods typically rely on LLMs' self-generation ability to optimize prompts. However, due to limited downward compatibility, the instruction-heavy prompts generated by advanced LLMs can overwhelm lightweight inference models and degrade response quality, while also lacking interpretability due to implicit optimization. In this work, we rethink prompt optimization through the lens of explicit and interpretable design. We first identify a set of model-agnostic prompt quality merits and empirically validate their effectiveness in enhancing prompt and response quality. We then introduce MePO, a merit-guided, locally deployable prompt optimizer trained on our merit-guided prompt preference dataset generated by a lightweight LLM. MePO avoids online optimization, reduces privacy concerns, and, by learning clear, interpretable merits, generalizes effectively to both large-scale and lightweight inference models. Experiments demonstrate that MePO achieves better results across diverse tasks and model types, offering a scalable and robust solution for real-world deployment.The code, model and dataset can be found in https://github.com/MidiyaZhu/MePO
中文摘要:MePO提出了一种基于明确质量标准的提示优化方法,通过可解释的优化准则提升不同模型的响应质量,无需在线处理即可确保兼容性和高效性。
English Summary: MePO introduces a merit-guided prompt optimization approach that enhances response quality across various models by using explicit, interpretable criteria, ensuring compatibility and effectiveness without online processing.

Authors:Yidan Wang, Yubing Ren, Yanan Cao, Binxing Fang
Title: From Trade-off to Synergy: A Versatile Symbiotic Watermarking Framework for Large Language Models
Abstract:
The rise of Large Language Models (LLMs) has heightened concerns about the misuse of AI-generated text, making watermarking a promising solution. Mainstream watermarking schemes for LLMs fall into two categories: logits-based and sampling-based. However, current schemes entail trade-offs among robustness, text quality, and security. To mitigate this, we integrate logits-based and sampling-based schemes, harnessing their respective strengths to achieve synergy. In this paper, we propose a versatile symbiotic watermarking framework with three strategies: serial, parallel, and hybrid. The hybrid framework adaptively embeds watermarks using token entropy and semantic entropy, optimizing the balance between detectability, robustness, text quality, and security. Furthermore, we validate our approach through comprehensive experiments on various datasets and models. Experimental results indicate that our method outperforms existing baselines and achieves state-of-the-art (SOTA) performance. We believe this framework provides novel insights into diverse watermarking paradigms. Our code is available at https://github.com/redwyd/SymMark.
中文摘要:本文提出了一种多功能共生水印框架,通过结合基于logits和基于采样的方案,在LLM生成文本的检测性、鲁棒性、文本质量和安全性之间实现优化平衡,在多个数据集和模型上取得了最先进的性能表现。
English Summary: This paper introduces a versatile symbiotic watermarking framework that combines logits-based and sampling-based approaches to optimize the balance between detectability, robustness, text quality, and security in LLM-generated text, achieving state-of-the-art performance across multiple datasets and models.

Authors:Yidan Wang, Yanan Cao, Yubing Ren, Fang Fang, Zheng Lin, Binxing Fang
Title: PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Abstract:
Large Language Models (LLMs) excel in various domains but pose inherent privacy risks. Existing methods to evaluate privacy leakage in LLMs often use memorized prefixes or simple instructions to extract data, both of which well-alignment models can easily block. Meanwhile, Jailbreak attacks bypass LLM safety mechanisms to generate harmful content, but their role in privacy scenarios remains underexplored. In this paper, we examine the effectiveness of jailbreak attacks in extracting sensitive information, bridging privacy leakage and jailbreak attacks in LLMs. Moreover, we propose PIG, a novel framework targeting Personally Identifiable Information (PII) and addressing the limitations of current jailbreak methods. Specifically, PIG identifies PII entities and their types in privacy queries, uses in-context learning to build a privacy context, and iteratively updates it with three gradient-based strategies to elicit target PII. We evaluate PIG and existing jailbreak methods using two privacy-related datasets. Experiments on four white-box and two black-box LLMs show that PIG outperforms baseline methods and achieves state-of-the-art (SoTA) results. The results underscore significant privacy risks in LLMs, emphasizing the need for stronger safeguards. Our code is availble at https://github.com/redwyd/PrivacyJailbreak.
中文摘要:本文提出PIG框架,通过越狱攻击从大语言模型中提取个人身份信息,其效果优于现有方法,揭示了大语言模型存在的严重隐私风险。
English Summary: This paper introduces PIG, a novel framework that leverages jailbreak attacks to extract Personally Identifiable Information from Large Language Models, demonstrating superior effectiveness over existing methods and highlighting significant privacy vulnerabilities.

Authors:Sajib Biswas, Mao Nishino, Samuel Jacob Chacko, Xiuwen Liu
Title: Adversarial Attack on Large Language Models using Exponentiated Gradient Descent
Abstract:
As Large Language Models (LLMs) are widely used, understanding them systematically is key to improving their safety and realizing their full potential. Although many models are aligned using techniques such as reinforcement learning from human feedback (RLHF), they are still vulnerable to jailbreaking attacks. Some of the existing adversarial attack methods search for discrete tokens that may jailbreak a target model while others try to optimize the continuous space represented by the tokens of the model's vocabulary. While techniques based on the discrete space may prove to be inefficient, optimization of continuous token embeddings requires projections to produce discrete tokens, which might render them ineffective. To fully utilize the constraints and the structures of the space, we develop an intrinsic optimization technique using exponentiated gradient descent with the Bregman projection method to ensure that the optimized one-hot encoding always stays within the probability simplex. We prove the convergence of the technique and implement an efficient algorithm that is effective in jailbreaking several widely used LLMs. We demonstrate the efficacy of the proposed technique using five open-source LLMs on four openly available datasets. The results show that the technique achieves a higher success rate with great efficiency compared to three other state-of-the-art jailbreaking techniques. The source code for our implementation is available at: https://github.com/sbamit/Exponentiated-Gradient-Descent-LLM-Attack
中文: 本文提出了一种针对大型语言模型的高效越狱技术,采用指数梯度下降与Bregman投影方法,相比现有技术具有更高的成功率和更强的效率优势。
English: This paper introduces an efficient jailbreaking technique for Large Language Models using exponentiated gradient descent with Bregman projection, which demonstrates higher success rates and greater efficiency compared to existing methods.

Authors:Long Chen, Xiaotian Song, Yanan Sun
Title: LAS: Loss-less ANN-SNN Conversion for Fully Spike-Driven Large Language Models
Abstract:
Spiking Large Language Models (LLMs) have emerged as an energy-efficient alternative to conventional LLMs through their event-driven computation. To effectively obtain spiking LLMs, researchers develop different ANN-to-SNN conversion methods by leveraging pre-trained ANN parameters while inheriting the energy efficiency of SNN. However, existing conversion methods struggle with extreme activation outliers and incompatible nonlinear operations of ANN-based LLMs. To address this, we propose a loss-less ANN-SNN conversion for fully spike-driven LLMs, termed LAS. Specifically, LAS introduces two novel neurons to convert the activation outlier and nonlinear operation of ANN-based LLMs. Moreover, LAS tailors the spike-equivalent Transformer components for spiking LLMs, which can ensure full spiking conversion without any loss of performance. Experimental results on six language models and two vision-language models demonstrate that LAS achieves loss-less conversion. Notably, on OPT-66B, LAS even improves the accuracy of 2\% on the WSC task. In addition, the parameter and ablation studies further verify the effectiveness of LAS. The source code is available at https://github.com/lc783/LAS
Chinese: LAS通过解决激活异常值和非线性操作问题,实现了尖峰大语言模型的无损转换,在保持全脉冲驱动的同时不损失性能。
English: LAS introduces a loss-less conversion method for spiking large language models by addressing activation outliers and nonlinear operations, achieving full spiking performance without accuracy compromise.

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Xuanzhao Dong, Hao Wang, Haiyu Wu, Huayu Li, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Title: DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models
Abstract:
Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose $\textit{Diversity-aware Reward Adjustment}$ (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in $\textit{DRA-GRPO}$ and $\textit{DGA-DR.~GRPO}$. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately $55. The code is available at https://github.com/xiwenc1/DRA-GRPO.
中文: 本文提出多样性感知奖励调整(DRA)方法,通过子模互信息将语义多样性融入奖励计算,在数学推理基准测试中以极低资源实现了最优性能。
English: This paper introduces Diversity-aware Reward Adjustment (DRA), a method that enhances reinforcement learning for language models by incorporating semantic diversity into rewards using Submodular Mutual Information, leading to state-of-the-art performance on mathematical reasoning benchmarks with minimal resources.

Authors:Derian Boer, Stephen Roth, Stefan Kramer
Title: Focus, Merge, Rank: Improved Question Answering Based on Semi-structured Knowledge Bases
Abstract:
In many real-world settings, machine learning models and interactive systems have access to both structured knowledge, e.g., knowledge graphs or tables, and unstructured content, e.g., natural language documents. However, most rely on either. Semi-Structured Knowledge Bases (SKBs) bridge this gap by linking unstructured content to nodes within structured data, thereby enabling new strategies for knowledge access and use. In this work, we present FocusedRetriever, a modular SKB-based framework for multi-hop question answering. It integrates components (VSS-based entity search, LLM-based generation of Cypher queries and pairwise re-ranking) in a way that enables it to outperform state-of-the-art methods across all three STaRK benchmark test sets, covering diverse domains and multiple performance metrics. The average first-hit rate exceeds that of the second-best method by 25.7%. FocusedRetriever leverages (1) the capacity of Large Language Models (LLMs) to extract relational facts and entity attributes from unstructured text, (2) node set joins to filter answer candidates based on these extracted triplets and constraints, (3) vector similarity search to retrieve and rank relevant unstructured content, and (4) the contextual capabilities of LLMs to finally rank the top-k answers. For generality, we only incorporate base LLMs in FocusedRetriever in our evaluation. However, our analysis of intermediate results highlights several opportunities for further upgrades including finetuning. The source code is publicly available at https://github.com/kramerlab/FocusedRetriever .
中文: FocusedRetriever是一个基于半结构化知识库的模块化框架,通过整合实体搜索、查询生成和重排序组件,在多领域多跳问答任务中全面超越了现有最优方法。
English: FocusedRetriever is a modular framework using Semi-Structured Knowledge Bases that integrates entity search, query generation, and re-ranking to outperform state-of-the-art methods in multi-hop question answering across diverse domains.

Authors:Yangyi Chen, Hao Peng, Tong Zhang, Heng Ji
Title: Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Abstract:
In standard large vision-language models (LVLMs) pre-training, the model typically maximizes the joint probability of the caption conditioned on the image via next-token prediction (NTP); however, since only a small subset of caption tokens directly relates to the visual content, this naive NTP unintentionally fits the model to noise and increases the risk of hallucination. We present PRIOR, a simple vision-language pre-training approach that addresses this issue by prioritizing image-related tokens through differential weighting in the NTP loss, drawing from the importance sampling framework. PRIOR introduces a reference model-a text-only large language model (LLM) trained on the captions without image inputs, to weight each token based on its probability for LVLMs training. Intuitively, tokens that are directly related to the visual inputs are harder to predict without the image and thus receive lower probabilities from the text-only reference LLM. During training, we implement a token-specific re-weighting term based on the importance scores to adjust each token's loss. We implement PRIOR in two distinct settings: LVLMs with visual encoders and LVLMs without visual encoders. We observe 19% and 8% average relative improvement, respectively, on several vision-language benchmarks compared to NTP. In addition, PRIOR exhibits superior scaling properties, as demonstrated by significantly higher scaling coefficients, indicating greater potential for performance gains compared to NTP given increasing compute and data.
Chinese: PRIOR是一种新颖的视觉语言预训练方法,通过基于纯文本参考模型的重要性分数对图像相关标记进行差异化损失加权,有效减少大型视觉语言模型的幻觉现象,相比标准的下一个标记预测方法实现了显著的性能提升。
English: PRIOR is a novel vision-language pre-training method that reduces hallucination in large vision-language models by prioritizing image-related tokens through differential loss weighting based on importance scores from a text-only reference model, achieving significant performance improvements over standard next-token prediction.

Authors:Nahid Alam, Karthik Reddy Kanjula, Surya Guthikonda, Timothy Chung, Bala Krishna S Vegesna, Abhipsha Das, Anthony Susevski, Ryan Sze-Yin Chan, S M Iftekhar Uddin, Shayekh Bin Islam, Roshan Santhosh, Snegha A, Drishti Sharma, Chen Liu, Isha Chaturvedi, Genta Indra Winata, Ashvanth. S, Snehanshu Mukherjee, Alham Fikri Aji
Title: Behind Maya: Building a Multilingual Vision Language Model
Abstract:
In recent times, we have seen a rapid development of large Vision-Language Models (VLMs). They have shown impressive results on academic benchmarks, primarily in widely spoken languages but lack performance on low-resource languages and varied cultural contexts. To address these limitations, we introduce Maya, an open-source Multilingual VLM. Our contributions are: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; and 2) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
中文:近期大型视觉语言模型在主流语言上表现优异,但在低资源语言和文化多样性方面存在不足,为此我们推出了开源多语言模型Maya,它通过支持八种语言的多语言数据集和模型,显著提升了跨文化视觉语言任务的理解能力。
English: Recent advances in large Vision-Language Models have excelled in major languages but struggle with low-resource languages and cultural diversity, prompting the introduction of Maya, an open-source multilingual VLM that enhances cross-cultural understanding through a multilingual dataset and model supporting eight languages.

Authors:Michael Majurski, Cynthia Matuszek
Title: Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora
Abstract:
Language Models (LMs) continue to advance, improving response quality and coherence. Given Internet-scale training datasets, LMs have likely encountered much of what users may ask them to generate in some form during their training. A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities. However, the human effort required for benchmark construction is rapidly being outpaced by the size and scope of the models under evaluation. Having humans build a benchmark for every possible domain of interest is impractical. Therefore, we propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations. This work leverages the same LMs to evaluate domain-specific knowledge automatically, using only grounding documents (e.g., a textbook) as input. This synthetic data benchmarking approach corresponds well with human curated questions producing a Spearman ranking correlation of 0.97 and a benchmark evaluation Pearson accuracy correlation of 0.75. This novel approach supports generating both multiple choice and open-ended synthetic data questions to gain diagnostic insight of LM capability. We apply this methodology to evaluate model performance on two recent arXiv preprints, discovering a surprisingly strong performance from Gemma-3 models on open-ended questions. Code is available at https://github.com/mmajurski/grounded-synth-lm-benchmark
中文:本文提出了一种利用语言模型和基础文档自动构建基于事实的合成基准测试方法,该方法与人工评估高度相关,并能跨领域对模型能力进行诊断性评估。
English: This paper introduces an automated method for creating fact-based synthetic benchmarks using language models and grounding documents, which correlates strongly with human-curated evaluations and enables diagnostic assessment of model capabilities across domains.

Authors:Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski, Yiming Yang, Ameet Talwalkar
Title: CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Abstract:
Partial differential equations (PDEs) are fundamental to modeling physical systems, yet solving them remains a complex challenge. Traditional numerical solvers rely on expert knowledge to implement and are computationally expensive, while neural-network-based solvers require large training datasets and often lack interpretability. In this work, we frame PDE solving as a code generation task and introduce CodePDE, the first inference framework for generating PDE solvers using large language models (LLMs). Leveraging advanced inference-time algorithms and scaling strategies, CodePDE unlocks critical capacities of LLM for PDE solving: reasoning, debugging, selfrefinement, and test-time scaling -- all without task-specific tuning. CodePDE achieves superhuman performance across a range of representative PDE problems. We also present a systematic empirical analysis of LLM generated solvers, analyzing their accuracy, efficiency, and numerical scheme choices. Our findings highlight the promise and the current limitations of LLMs in PDE solving, offering a new perspective on solver design and opportunities for future model development. Our code is available at https://github.com/LithiumDA/CodePDE.
中文摘要:CodePDE首次提出通过大语言模型生成代码来求解偏微分方程的推理框架,无需特定任务调优即实现超越人类的表现,同时具备自主推理与调试能力。
English Summary: CodePDE introduces a novel framework that uses large language models to generate PDE solvers through code generation, achieving superior performance without task-specific training while enabling reasoning and self-improvement capabilities.

Authors:Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, Karan Singhal
Title: HealthBench: Evaluating Large Language Models Towards Improved Human Health
Abstract:
We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.
中文: HealthBench是一个开源基准测试,通过多轮对话评估大型语言模型在医疗领域的性能与安全性,采用医生制定的多样化评分标准,覆盖多种健康场景和行为维度,展现了模型性能的持续提升。
English: HealthBench is an open-source benchmark for evaluating the performance and safety of large language models in healthcare through multi-turn conversations, using physician-created rubrics across diverse health contexts and behavioral dimensions, showing steady model improvements over time.

Authors:Donghoon Kim, Minji Bae, Kyuhong Shim, Byonghyo Shim
Title: Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Abstract:
Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.
中文摘要:提出的视觉引导解码(VGD)方法利用大语言模型和CLIP指导,为文生图模型生成连贯、可读的提示文本,无需额外训练即可在可解释性和上下文相关性上超越现有技术。
English Summary: The proposed Visually Guided Decoding (VGD) method uses large language models and CLIP guidance to generate coherent, human-readable prompts for text-to-image models, outperforming existing techniques in interpretability and contextual relevance without requiring additional training.

Authors:Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie Song
Title: Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
Abstract:
The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
中文摘要:本文综述了新兴的LLM心理测量学领域,该领域运用心理测量工具与理论来应对大语言模型评估中的挑战,致力于建立以人为中心的人工智能系统并推动其发展。
English Summary: This review introduces LLM Psychometrics, an interdisciplinary field using psychometric principles to address the challenges of evaluating large language models beyond traditional benchmarks, aiming to advance human-centered AI systems.

Authors:Licheng Zhang, Bach Le, Naveed Akhtar, Siew-Kei Lam, Tuan Ngo
Title: Large Language Models for Computer-Aided Design: A Survey
Abstract:
Large Language Models (LLMs) have seen rapid advancements in recent years, with models like ChatGPT and DeepSeek, showcasing their remarkable capabilities across diverse domains. While substantial research has been conducted on LLMs in various fields, a comprehensive review focusing on their integration with Computer-Aided Design (CAD) remains notably absent. CAD is the industry standard for 3D modeling and plays a vital role in the design and development of products across different industries. As the complexity of modern designs increases, the potential for LLMs to enhance and streamline CAD workflows presents an exciting frontier. This article presents the first systematic survey exploring the intersection of LLMs and CAD. We begin by outlining the industrial significance of CAD, highlighting the need for AI-driven innovation. Next, we provide a detailed overview of the foundation of LLMs. We also examine both closed-source LLMs as well as publicly available models. The core of this review focuses on the various applications of LLMs in CAD, providing a taxonomy of six key areas where these models are making considerable impact. Finally, we propose several promising future directions for further advancements, which offer vast opportunities for innovation and are poised to shape the future of CAD technology. Github: https://github.com/lichengzhanguom/LLMs-CAD-Survey-Taxonomy
中文摘要:本文首次系统综述了大语言模型与计算机辅助设计的融合应用,梳理了六大关键应用领域并提出了未来研究方向。
English Summary: This article presents the first systematic survey exploring how Large Language Models can enhance Computer-Aided Design workflows, examining six key application areas and proposing future research directions.

Authors:Jiashen, Du, Jesse Yao, Allen Liu, Zhekai Zhang
Title: Are LLMs complicated ethical dilemma analyzers?
Abstract:
One open question in the study of Large Language Models (LLMs) is whether they can emulate human ethical reasoning and act as believable proxies for human judgment. To investigate this, we introduce a benchmark dataset comprising 196 real-world ethical dilemmas and expert opinions, each segmented into five structured components: Introduction, Key Factors, Historical Theoretical Perspectives, Resolution Strategies, and Key Takeaways. We also collect non-expert human responses for comparison, limited to the Key Factors section due to their brevity. We evaluate multiple frontier LLMs (GPT-4o-mini, Claude-3.5-Sonnet, Deepseek-V3, Gemini-1.5-Flash) using a composite metric framework based on BLEU, Damerau-Levenshtein distance, TF-IDF cosine similarity, and Universal Sentence Encoder similarity. Metric weights are computed through an inversion-based ranking alignment and pairwise AHP analysis, enabling fine-grained comparison of model outputs to expert responses. Our results show that LLMs generally outperform non-expert humans in lexical and structural alignment, with GPT-4o-mini performing most consistently across all sections. However, all models struggle with historical grounding and proposing nuanced resolution strategies, which require contextual abstraction. Human responses, while less structured, occasionally achieve comparable semantic similarity, suggesting intuitive moral reasoning. These findings highlight both the strengths and current limitations of LLMs in ethical decision-making.
中文摘要:本研究通过比较大语言模型与专家及非专家对人类伦理困境的判断,评估其模拟人类伦理推理的能力,发现虽然模型在文本结构对齐方面表现优异,但在情境抽象和历史依据方面仍存在不足。
English Summary: This study evaluates whether large language models can replicate human ethical reasoning by comparing their structured responses to expert and non-expert human judgments across 196 ethical dilemmas, finding that while LLMs excel in lexical alignment they struggle with contextual abstraction and historical grounding.

Authors:Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Li Yuan, Yonghong Tian
Title: BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
Abstract:
Biological protocols are fundamental to reproducibility and safety in life science research. While large language models (LLMs) perform well on general tasks, their systematic evaluation on these highly specialized, accuracy-critical, and inherently procedural texts remains limited. In this work, we present BioProBench, the first large-scale, multi-task benchmark for biological protocol understanding and reasoning. While there are several benchmark tasks involving protocol question answering, BioProBench provides a comprehensive suite of five core tasks: Protocol Question Answering, Step Ordering, Error Correction, Protocol Generation, and Protocol Reasoning, enabling a holistic evaluation of LLMs on procedural biological texts. Built upon 27K original protocols, it yields nearly 556K high-quality structured instances. We evaluate 12 mainstream open/closed-source LLMs. Experimental results reveal that some models perform well on basic understanding tasks (e.g., \sim70% PQA-Acc., >64% ERR F1), but struggle significantly with deep reasoning and structured generation tasks like ordering and generation. Furthermore, model comparisons show diverse performance: certain open-source models approach closed-source levels on some tasks, yet bio-specific small models lag behind general LLMs, indicating limitations on complex procedural content. Overall, BioProBench, through its task design and experimental findings, systematically reveals the fundamental challenges for current LLMs in procedural knowledge understanding, deep adaptability to specific domains, reliability of structured reasoning, and handling of sophisticated precision and safety constraints, providing key directions for future AI in the field of scientific experiment automation. The code and data are available at: https://github.com/YuyangSunshine/bioprotocolbench and https://huggingface.co/datasets/BioProBench/BioProBench.
中文: BioProBench是首个针对生物实验规程的大规模评估基准,发现大语言模型在基础理解任务表现良好,但在深度推理和结构化生成方面存在明显不足。
English: BioProBench is the first comprehensive benchmark for evaluating large language models on biological protocols, revealing their strengths in basic understanding but significant struggles with deep reasoning and structured generation tasks.

Authors:LLM-Core Xiaomi, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, Kai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
Title: MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining
Abstract:
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.
中文: MiMo-7B 是一款专为推理任务优化的语言模型,通过改进的预训练和强化学习方法,在数学、编程及通用推理任务上表现卓越,性能超越更大规模模型及 OpenAI o1-mini。
English: MiMo-7B is a reasoning-optimized language model that demonstrates exceptional performance across mathematics, coding, and general reasoning tasks, surpassing larger models and OpenAI o1-mini through enhanced pre-training and reinforcement learning techniques.

Authors:Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang
Title: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models
Abstract:
Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.
中文摘要:本研究提出了一个多维约束框架和自动化流程,用于生成多样化的代码可验证测试来评估大语言模型的指令遵循能力,揭示了不同约束形式下的显著性能差异,并证明了其在强化学习中的应用价值——通过注意力模块的参数调整有效提升了模型的约束识别与遵循能力。
English Summary: This study introduces a multi-dimensional constraint framework and automated pipeline to generate diverse, code-verifiable tests for evaluating large language models' instruction-following capabilities, revealing significant performance variations across constraint types and demonstrating its utility for reinforcement learning that enhances constraint adherence through attention module modifications.

Authors:Truc Mai-Thanh Nguyen, Dat Minh Nguyen, Son T. Luu, Kiet Van Nguyen
Title: ViMRHP: A Vietnamese Benchmark Dataset for Multimodal Review Helpfulness Prediction via Human-AI Collaborative Annotation
Abstract:
Multimodal Review Helpfulness Prediction (MRHP) is an essential task in recommender systems, particularly in E-commerce platforms. Determining the helpfulness of user-generated reviews enhances user experience and improves consumer decision-making. However, existing datasets focus predominantly on English and Indonesian, resulting in a lack of linguistic diversity, especially for low-resource languages such as Vietnamese. In this paper, we introduce ViMRHP (Vietnamese Multimodal Review Helpfulness Prediction), a large-scale benchmark dataset for MRHP task in Vietnamese. This dataset covers four domains, including 2K products with 46K reviews. Meanwhile, a large-scale dataset requires considerable time and cost. To optimize the annotation process, we leverage AI to assist annotators in constructing the ViMRHP dataset. With AI assistance, annotation time is reduced (90 to 120 seconds per task down to 20 to 40 seconds per task) while maintaining data quality and lowering overall costs by approximately 65%. However, AI-generated annotations still have limitations in complex annotation tasks, which we further examine through a detailed performance analysis. In our experiment on ViMRHP, we evaluate baseline models on human-verified and AI-generated annotations to assess their quality differences. The ViMRHP dataset is publicly available at https://github.com/trng28/ViMRHP
Chinese: 本文介绍了ViMRHP这一大规模越南语多模态评论有用性预测数据集,通过AI辅助显著降低了标注时间和成本并保持质量,弥补了现有资源中语言多样性的不足。
English: This paper introduces ViMRHP, a large-scale Vietnamese multimodal review helpfulness prediction dataset that utilizes AI assistance to significantly reduce annotation time and costs while maintaining quality, addressing the lack of linguistic diversity in existing resources.

Authors:Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne
Title: On the Robustness of Reward Models for Language Model Alignment
Abstract:
The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: https://github.com/LinkedIn-XFACT/RM-Robustness.
Bradley-Terry模型在人类反馈强化学习中易因隐藏状态范数过度分散导致过优化,而提出的批处理零和正则化通过约束奖励极值增强了奖励模型的分布鲁棒性,显著提升了策略与黄金偏好模型的对齐效果。
The Bradley-Terry model in RLHF suffers from over-optimization due to excessive dispersion of hidden states, but the proposed batch-wise sum-to-zero regularization enhances reward model robustness and improves policy alignment in unseen data scenarios.

Authors:Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, Jiawei Han
Title: DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker. Since irrelevant documents in RAG systems can mislead the generator, the reranker plays a vital role in refining retrieved documents to enhance generation quality and explainability. However, it is challenging to determine the appropriate number of documents ($k$) that the reranker should select: too few may result in missing critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results among models of same parameter sizes. The model, data and code are available at https://github.com/GasolSun36/DynamicRAG.
中文摘要:DynamicRAG提出了一种基于强化学习的动态重排器,通过大语言模型输出质量作为反馈信号,在检索增强生成系统中自适应调整文档选择顺序和数量,在多个知识密集型数据集上实现了最优性能。
English Summary: DynamicRAG introduces a reinforcement learning-optimized reranker that dynamically adjusts document selection in retrieval-augmented generation systems, achieving state-of-the-art performance across multiple datasets by using LLM output quality as feedback.

Authors:Yifan Wei, Xiaoyan Yu, Tengfei Pan, Angsheng Li, Li Du
Title: Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs
Abstract:
Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model's true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements. The code and data for our methods and experiments are available at https://github.com/weiyifan1023/senator.
中文摘要:SENATOR框架通过结构熵和蒙特卡洛树搜索精准识别大语言模型的知识盲区,并生成针对性合成数据进行监督微调,有效提升了模型在专业领域的表现。
English Summary: The SENATOR framework uses structural entropy and Monte Carlo Tree Search to identify and fill knowledge gaps in large language models through targeted synthetic data generation, significantly improving their performance in specialized domains.

Authors:Zheng Yao, Shuai Wang, Guido Zuccon
Title: Pre-training vs. Fine-tuning: A Reproducibility Study on Dense Retrieval Knowledge Acquisition
Abstract:
Dense retrievers utilize pre-trained backbone language models (e.g., BERT, LLaMA) that are fine-tuned via contrastive learning to perform the task of encoding text into sense representations that can be then compared via a shallow similarity operation, e.g. inner product. Recent research has questioned the role of fine-tuning vs. that of pre-training within dense retrievers, specifically arguing that retrieval knowledge is primarily gained during pre-training, meaning knowledge not acquired during pre-training cannot be sub-sequentially acquired via fine-tuning. We revisit this idea here as the claim was only studied in the context of a BERT-based encoder using DPR as representative dense retriever. We extend the previous analysis by testing other representation approaches (comparing the use of CLS tokens with that of mean pooling), backbone architectures (encoder-only BERT vs. decoder-only LLaMA), and additional datasets (MSMARCO in addition to Natural Questions). Our study confirms that in DPR tuning, pre-trained knowledge underpins retrieval performance, with fine-tuning primarily adjusting neuron activation rather than reorganizing knowledge. However, this pattern does not hold universally, such as in mean-pooled (Contriever) and decoder-based (LLaMA) models. We ensure full reproducibility and make our implementation publicly available at https://github.com/ielab/DenseRetriever-Knowledge-Acquisition.
Chinese: 密集检索器的性能主要依赖于预训练知识,微调仅调整神经元激活而非重组知识,但这一模式在如Contriever和LLaMA等模型中并不普遍适用。
English: Dense retrievers rely heavily on pre-trained knowledge for performance, with fine-tuning mainly adjusting neuron activations rather than reorganizing knowledge, though this pattern varies across models like Contriever and LLaMA.

Authors:Lhuqita Fazry
Title: A Split-then-Join Approach to Abstractive Summarization for Very Long Documents in a Low Resource Setting
Abstract:
$\texttt{BIGBIRD-PEGASUS}$ model achieves $\textit{state-of-the-art}$ on abstractive text summarization for long documents. However it's capacity still limited to maximum of $4,096$ tokens, thus caused performance degradation on summarization for very long documents. Common method to deal with the issue is to truncate the documents. In this reasearch, we'll use different approach. We'll use the pretrained $\texttt{BIGBIRD-PEGASUS}$ model by fine tuned the model on other domain dataset. First, we filter out all documents which length less than $20,000$ tokens to focus on very long documents. To prevent domain shifting problem and overfitting on transfer learning due to small dataset, we augment the dataset by splitting document-summary training pair into parts, to fit the document into $4,096$ tokens. Source code available on $\href{https://github.com/lhfazry/SPIN-summ}{https://github.com/lhfazry/SPIN-summ}$.
Chinese: BIGBIRD-PEGASUS模型在长文档摘要任务中表现卓越,但受限于4096个词元,本研究通过筛选超长文档并分割数据以适配模型长度,进行领域微调,有效解决了性能下降问题。
English: The BIGBIRD-PEGASUS model achieves state-of-the-art performance in abstractive text summarization for long documents but is limited to 4,096 tokens, so this research fine-tunes it on a domain-specific dataset augmented by splitting documents to handle very long texts effectively.

Authors:Zihan Guan, Mengxuan Hu, Ronghang Zhu, Sheng Li, Anil Vullikanti
Title: Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Abstract:
Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.
中文摘要:研究发现,即使在良性数据集上微调大型语言模型也会显著增加其输出的危害性,而一种利用异常样本的新攻击方法严重破坏了多种模型的安全对齐,且现有防御措施大多无效。
English Summary: Fine-tuning large language models on even benign datasets can dangerously increase their harmfulness, and a new attack method using outlier samples severely compromises safety alignment across various models, with most existing defenses proving ineffective.

Authors:Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
Title: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Abstract:
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{https://github.com/qiuzh20/gated_attention}{codes}$ and $\href{https://huggingface.co/QwQZh/gated_attention}{models}$ to facilitate future research.
中文: 本研究表明,在缩放点积注意力后添加头部特定的Sigmoid门控,通过引入非线性和查询相关的稀疏门控机制,能持续提升模型性能、训练稳定性及长文本处理能力。
English: This study demonstrates that adding a head-specific sigmoid gate after Scaled Dot-Product Attention consistently enhances model performance, training stability, and long-context capabilities by introducing non-linearity and query-dependent sparse gating.

Authors:Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang
Title: From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback
Abstract:
Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce \textbf{Feedbacker}, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model's specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC$^{2}$ (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our project homepage and dataset are available at https://liudan193.github.io/Feedbacker.

Authors:Dominik Koterwa, Maciej Świtała
Title: Enhancing BERTopic with Intermediate Layer Representations
Abstract:
BERTopic is a topic modeling algorithm that leverages transformer-based embeddings to create dense clusters, enabling the estimation of topic structures and the extraction of valuable insights from a corpus of documents. This approach allows users to efficiently process large-scale text data and gain meaningful insights into its structure. While BERTopic is a powerful tool, embedding preparation can vary, including extracting representations from intermediate model layers and applying transformations to these embeddings. In this study, we evaluate 18 different embedding representations and present findings based on experiments conducted on three diverse datasets. To assess the algorithm's performance, we report topic coherence and topic diversity metrics across all experiments. Our results demonstrate that, for each dataset, it is possible to find an embedding configuration that performs better than the default setting of BERTopic. Additionally, we investigate the influence of stop words on different embedding configurations.
中文: 本研究评估了BERTopic在三个数据集上的18种嵌入表示,发现优化配置在主题连贯性和多样性上优于默认设置,同时探讨了停用词对不同嵌入配置的影响。
English: This study evaluates 18 embedding representations for BERTopic across three datasets, finding that optimized configurations outperform default settings in topic coherence and diversity while also examining stop words' impact.

Authors:Woosang Lim, Zekun Li, Gyuwan Kim, Sungyoung Ji, HyeonJung Kim, Kyuri Choi, Jin Hyuk Lim, Kyungpyo Park, William Yang Wang
Title: MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG
Abstract:
Long-context large language models (LC LLMs) combined with retrieval-augmented generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained windows, and fragmented information from suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical RAG framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through real-time chunk- and document-level expansions. By initiating with finest-level retrieval and progressively incorporating broader, higher-level context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm MacRAG consistently surpasses baseline RAG pipelines in single- and multi-step generation using Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at https://github.com/Leezekun/MacRAG.
中文: MacRAG提出了一种分层检索增强生成框架,通过自适应融合从粗到细的文档粒度来优化精度和覆盖范围,在多种大语言模型的多跳推理任务中持续超越基线模型。
English: MacRAG introduces a hierarchical RAG framework that adaptively merges coarse-to-fine document granularities to optimize precision and coverage, consistently outperforming baseline models in multi-hop reasoning tasks across various LLMs.

Authors:Xinyue Lou, You Li, Jinan Xu, Xiangyu Shi, Chi Chen, Kaiyu Huang
Title: Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model
Abstract:
The rapid development of Multimodal Large Reasoning Models (MLRMs) has demonstrated broad application potential, yet their safety and reliability remain critical concerns that require systematic exploration. To address this gap, we conduct a comprehensive and systematic safety evaluation of 11 MLRMs across 5 benchmarks and unveil prevalent safety degradation phenomena in most advanced models. Moreover, our analysis reveals distinct safety patterns across different benchmarks: significant safety degradation is observed across jailbreak robustness benchmarks, whereas safety-awareness benchmarks demonstrate less pronounced degradation. In particular, the long thought process in some scenarios even enhances safety performance. Therefore, it is a potential approach to address safety issues in MLRMs by leveraging the intrinsic reasoning capabilities of the model to detect unsafe intent. To operationalize this insight, we construct a multimodal tuning dataset that incorporates a safety-oriented thought process. Experimental results from fine-tuning existing MLRMs with this dataset effectively enhances the safety on both jailbreak robustness and safety-awareness benchmarks. This study provides a new perspective for developing safe MLRMs. Our dataset is available at https://github.com/xinyuelou/Think-in-Safety.
中文: 本研究系统评估了11种多模态大推理模型的安全性,发现普遍存在安全性能下降现象,并提出通过构建安全导向思维链数据集进行微调的新方法,有效利用模型内在推理能力提升其安全防护水平。
English: This study systematically evaluates the safety of 11 Multimodal Large Reasoning Models, revealing prevalent safety degradation and proposing a novel approach that enhances model safety by integrating safety-oriented reasoning processes through fine-tuning with a specially constructed dataset.

Authors:Vytenis Šliogeris, Povilas Daniušis, Artūras Nakvosas
Title: Full-Parameter Continual Pretraining of Gemma2: Insights into Fluency and Domain Knowledge
Abstract:
In this technical report, we empirically investigate the relationship between linguistic fluency and domain knowledge in the context of continual learning with large language models (LLMs). Specifically, we enhance the linguistic fluency of the Gemma2 LLM for the Lithuanian language by autoregressively pretraining its full parameter set on the first 10\% of the Lithuanian language component of the CulturaX dataset. To prevent catastrophic forgetting of the model's existing domain knowledge, we apply Elastic Weight Consolidation (EWC), leveraging Fisher information estimated using data from the Massive Multitask Language Understanding (MMLU) benchmark. In the post-training evaluations, we assess linguistic fluency through perplexity and evaluate domain knowledge using accuracy on a suite of language understanding benchmarks, including ARC-Easy, Belebele, GSM8K, HellaSwag, MMLU, TruthfulQA, and Winogrande, in both English and Lithuanian. The empirical results demonstrate that EWC not only mitigates catastrophic forgetting by preserving the model's performance in terms of both linguistic fluency and domain knowledge but also improves or maintains these capabilities for the newly added Lithuanian language. These findings highlight the potential for more efficient adaptation of general-purpose LLMs to under-represented languages without requiring access to the original training data. The accompanying codebase is openly accessible at https://github.com/Neurotechnology/LLM_EWC.
中文摘要:本研究表明,通过全参数预训练结合弹性权重巩固(EWC)方法,在提升Gemma2模型立陶宛语流畅度的同时有效防止了灾难性遗忘,实现了无需原始训练数据即可使通用大语言模型高效适应小语种的能力。
English Summary: This study demonstrates that Elastic Weight Consolidation (EWC) effectively prevents catastrophic forgetting while enhancing the Gemma2 model's Lithuanian language fluency through full-parameter pretraining, enabling efficient adaptation to underrepresented languages without original training data.

Authors:Jinze Lv, Jian Chen, Zi Long, Xianghua Fu, Yin Chen
Title: TopicVD: A Topic-Based Dataset of Video-Guided Multimodal Machine Translation for Documentaries
Abstract:
Most existing multimodal machine translation (MMT) datasets are predominantly composed of static images or short video clips, lacking extensive video data across diverse domains and topics. As a result, they fail to meet the demands of real-world MMT tasks, such as documentary translation. In this study, we developed TopicVD, a topic-based dataset for video-supported multimodal machine translation of documentaries, aiming to advance research in this field. We collected video-subtitle pairs from documentaries and categorized them into eight topics, such as economy and nature, to facilitate research on domain adaptation in video-guided MMT. Additionally, we preserved their contextual information to support research on leveraging the global context of documentaries in video-guided MMT. To better capture the shared semantics between text and video, we propose an MMT model based on a cross-modal bidirectional attention module. Extensive experiments on the TopicVD dataset demonstrate that visual information consistently improves the performance of the NMT model in documentary translation. However, the MMT model's performance significantly declines in out-of-domain scenarios, highlighting the need for effective domain adaptation methods. Additionally, experiments demonstrate that global context can effectively improve translation performance. % Dataset and our implementations are available at https://github.com/JinzeLv/TopicVD
中文: 现有数据集缺乏多样化视频内容,因此我们构建了TopicVD纪录片数据集,通过主题分类和跨模态注意力模型验证了视觉信息提升翻译效果,但跨领域适应性仍需改进。
English: Current multimodal machine translation datasets lack diverse video content, so we created TopicVD, a documentary-focused dataset with categorized topics and a cross-modal attention model, which shows visual data enhances translation but struggles with domain shifts.

Authors:Qianbo Zang, Christophe Zgrzendek, Igor Tchappi, Afshin Khadangi, Johannes Sedlmeir
Title: KG-HTC: Integrating Knowledge Graphs into LLMs for Effective Zero-shot Hierarchical Text Classification
Abstract:
Hierarchical Text Classification (HTC) involves assigning documents to labels organized within a taxonomy. Most previous research on HTC has focused on supervised methods. However, in real-world scenarios, employing supervised HTC can be challenging due to a lack of annotated data. Moreover, HTC often faces issues with large label spaces and long-tail distributions. In this work, we present Knowledge Graphs for zero-shot Hierarchical Text Classification (KG-HTC), which aims to address these challenges of HTC in applications by integrating knowledge graphs with Large Language Models (LLMs) to provide structured semantic context during classification. Our method retrieves relevant subgraphs from knowledge graphs related to the input text using a Retrieval-Augmented Generation (RAG) approach. Our KG-HTC can enhance LLMs to understand label semantics at various hierarchy levels. We evaluate KG-HTC on three open-source HTC datasets: WoS, DBpedia, and Amazon. Our experimental results show that KG-HTC significantly outperforms three baselines in the strict zero-shot setting, particularly achieving substantial improvements at deeper levels of the hierarchy. This evaluation demonstrates the effectiveness of incorporating structured knowledge into LLMs to address HTC's challenges in large label spaces and long-tailed label distributions. Our code is available at: https://github.com/QianboZang/KG-HTC.
中文: 本文提出KG-HTC方法,通过将知识图谱与大语言模型结合,为零样本分层文本分类提供结构化语义上下文,有效解决了大规模标签空间和长尾分布问题,实验表明其性能显著优于基线模型。
English: This paper introduces KG-HTC, a zero-shot hierarchical text classification method that integrates knowledge graphs with large language models to address challenges like large label spaces and long-tail distributions by providing structured semantic context, significantly outperforming baselines in experiments.

Authors:Hanxun Huang, Sarah Erfani, Yige Li, Xingjun Ma, James Bailey
Title: X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Abstract:
As Contrastive Language-Image Pre-training (CLIP) models are increasingly adopted for diverse downstream tasks and integrated into large vision-language models (VLMs), their susceptibility to adversarial perturbations has emerged as a critical concern. In this work, we introduce \textbf{X-Transfer}, a novel attack method that exposes a universal adversarial vulnerability in CLIP. X-Transfer generates a Universal Adversarial Perturbation (UAP) capable of deceiving various CLIP encoders and downstream VLMs across different samples, tasks, and domains. We refer to this property as \textbf{super transferability}--a single perturbation achieving cross-data, cross-domain, cross-model, and cross-task adversarial transferability simultaneously. This is achieved through \textbf{surrogate scaling}, a key innovation of our approach. Unlike existing methods that rely on fixed surrogate models, which are computationally intensive to scale, X-Transfer employs an efficient surrogate scaling strategy that dynamically selects a small subset of suitable surrogates from a large search space. Extensive evaluations demonstrate that X-Transfer significantly outperforms previous state-of-the-art UAP methods, establishing a new benchmark for adversarial transferability across CLIP models. The code is publicly available in our \href{https://github.com/HanxunH/XTransferBench}{GitHub repository}.
中文: 本文提出X-Transfer攻击方法,通过动态代理缩放生成具有超级迁移性的通用对抗扰动,能高效地跨任务、跨领域破坏CLIP模型。
English: The paper introduces X-Transfer, a novel attack method that generates a universal adversarial perturbation with super transferability, efficiently compromising CLIP models across various tasks and domains through dynamic surrogate scaling.

Authors:Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Abstract:
Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
中文摘要:本研究通过模型融合实现了大型语言模型推理能力向视觉语言模型的无需训练迁移,并揭示了感知功能主要分布于模型早期层,而融合后推理能力则扩展至所有层的工作机制。
English Summary: This study demonstrates that model merging effectively transfers reasoning capabilities from Large Language Models to Vision-Language Models without requiring training, while revealing that perception functions are concentrated in early layers and reasoning emerges across all layers after integration.

Authors:Ran Zhang, Wei Zhao, Lieve Macken, Steffen Eger
Title: LiTransProQA: an LLM-based Literary Translation evaluation metric with Professional Question Answering
Abstract:
The impact of Large Language Models (LLMs) has extended into literary domains. However, existing evaluation metrics prioritize mechanical accuracy over artistic expression and tend to overrate machine translation as being superior to human translation from experienced professionals. In the long run, this bias could result in an irreversible decline in translation quality and cultural authenticity. In response to the urgent need for a specialized literary evaluation metric, we introduce LiTransProQA, a novel, reference-free, LLM-based question-answering framework designed for literary translation evaluation. LiTransProQA uniquely integrates insights from professional literary translators and researchers, focusing on critical elements in literary quality assessment such as literary devices, cultural understanding, and authorial voice. Our extensive evaluation shows that while literary-finetuned XCOMET-XL yields marginal gains, LiTransProQA substantially outperforms current metrics, achieving up to 0.07 gain in correlation and surpassing the best state-of-the-art metrics by over 15 points in adequacy assessments. Incorporating professional translator insights as weights further improves performance, highlighting the value of translator inputs. Notably, LiTransProQA reaches human-level evaluation performance comparable to trained student evaluators. It shows broad applicability to open-source models like LLaMa3.3-70b and Qwen2.5-32b, indicating its potential as an accessible and training-free tool for evaluating literary translations that require local processing due to copyright or ethical considerations. The code and datasets are available under: https://github.com/zhangr2021/TransProQA.
中文摘要:LiTransProQA是一种基于大语言模型的无参考评估框架,通过整合专业译者见解来评估文学翻译质量,其表现超越现有指标并达到人类评估水平。
English Summary: LiTransProQA is a novel, reference-free LLM-based framework that integrates professional translator insights to evaluate literary translations, outperforming existing metrics and achieving human-level assessment performance.

Authors:Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Title: TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Abstract:
Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.
中文:TokLIP是一种创新的视觉分词器,通过将高层语义融入矢量量化标记并支持高效的端到端训练,显著提升了多模态理解能力,在理解和生成任务中均实现了卓越的数据效率和性能表现。
English: TokLIP is a novel visual tokenizer that enhances multimodal comprehension by integrating high-level semantics into vector-quantized tokens while enabling efficient end-to-end training, achieving superior data efficiency and performance in both understanding and generation tasks.

Authors:Yuhui Xu, Hanze Dong, Lei Wang, Doyen Sahoo, Junnan Li, Caiming Xiong
Title: Scalable Chain of Thoughts via Elastic Reasoning
Abstract:
Large reasoning models (LRMs) have achieved remarkable progress on complex tasks by generating extended chains of thought (CoT). However, their uncontrolled output lengths pose significant challenges for real-world deployment, where inference-time budgets on tokens, latency, or compute are strictly constrained. We propose Elastic Reasoning, a novel framework for scalable chain of thoughts that explicitly separates reasoning into two phases--thinking and solution--with independently allocated budgets. At test time, Elastic Reasoning prioritizes the completeness of solution segments, significantly improving reliability under tight resource constraints. To train models that are robust to truncated thinking, we introduce a lightweight budget-constrained rollout strategy, integrated into GRPO, which teaches the model to reason adaptively when the thinking process is cut short and generalizes effectively to unseen budget constraints without additional training. Empirical results on mathematical (AIME, MATH500) and programming (LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning performs robustly under strict budget constraints, while incurring significantly lower training cost than baseline methods. Remarkably, our approach also produces more concise and efficient reasoning even in unconstrained settings. Our code has been made available at https://github.com/SalesforceAIResearch/Elastic-Reasoning.
中文摘要:弹性推理是一种新颖框架,将推理过程划分为思维和解答两个独立预算阶段,使大型推理模型能够在严格计算限制下保持稳健性能,同时提高效率和简洁性。
English Summary: Elastic Reasoning is a novel framework that divides reasoning into thinking and solution phases with independent budgets, enabling large reasoning models to perform robustly under strict computational constraints while improving efficiency and conciseness.

Authors:Mengze Hong, Wailing Ng, Chen Jason Zhang, Di Jiang
Title: QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
Abstract:
The rapid advancement of Chinese LLMs underscores the need for vertical-domain evaluations to ensure reliable applications. However, existing benchmarks often lack domain coverage and provide limited insights into the Chinese working context. Leveraging qualification exams as a unified framework for expertise evaluation, we introduce QualBench, the first multi-domain Chinese QA benchmark dedicated to localized assessment of Chinese LLMs. The dataset includes over 17,000 questions across six vertical domains, drawn from 24 Chinese qualifications to align with national policies and professional standards. Results reveal an interesting pattern of Chinese LLMs consistently surpassing non-Chinese models, with the Qwen2.5 model outperforming the more advanced GPT-4o, emphasizing the value of localized domain knowledge in meeting qualification requirements. The average accuracy of 53.98% reveals the current gaps in domain coverage within model capabilities. Furthermore, we identify performance degradation caused by LLM crowdsourcing, assess data contamination, and illustrate the effectiveness of prompt engineering and model fine-tuning, suggesting opportunities for future improvements through multi-domain RAG and Federated Learning.
中文摘要:QualBench是首个基于中国资格考试的多领域中文问答基准,通过评估发现中文大模型在本地化知识方面优于非中文模型,并揭示了通过提示工程等技术提升性能的改进空间。
English Summary: QualBench is the first multi-domain Chinese QA benchmark using qualification exams to evaluate localized knowledge of Chinese LLMs, revealing their superiority over non-Chinese models and identifying key areas for improvement through techniques like prompt engineering.

Authors:Boyi Deng, Yu Wan, Yidan Zhang, Baosong Yang, Fuli Feng
Title: Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Abstract:
The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into a sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs. The code is publicly available at https://github.com/Aatrox103/multilingual-llm-features.
中文摘要:本研究提出了一种新指标来评估稀疏自编码器(SAE)特征的单语性,发现某些特征与特定语言密切相关,通过针对性消除或增强这些特征,可精确控制大语言模型的多语言生成能力。
English Summary: This study introduces a novel metric to evaluate the monolinguality of features from Sparse Autoencoders (SAEs), revealing that certain features are language-specific and their targeted ablation or enhancement can precisely control the multilingual output of Large Language Models.

Authors:Jaehyun Jeon, Min Soo Kim, Jang Han Yoon, Sumin Shim, Yejin Choi, Hanbin Kim, Youngjae Yu
Title: Do MLLMs Capture How Interfaces Guide User Behavior? A Benchmark for Multimodal UI/UX Design Understanding
Abstract:
User interface (UI) design goes beyond visuals, guiding user behavior and overall user experience (UX). Strategically crafted interfaces, for example, can boost sign-ups and drive business sales, underscoring the shift toward UI/UX as a unified design concept. While recent studies have explored UI quality evaluation using Multimodal Large Language Models (MLLMs), they largely focus on surface-level features, overlooking behavior-oriented aspects. To fill this gap, we introduce WiserUI-Bench, a novel benchmark for assessing models' multimodal understanding of UI/UX design. It includes 300 diverse real-world UI image pairs, each consisting of two design variants A/B-tested at scale by actual companies, where one was empirically validated to steer more user actions than the other. Each pair is accompanied one or more of 684 expert-curated rationales that capture key factors behind each winning design's effectiveness, spanning diverse cognitive dimensions of UX. Our benchmark supports two core tasks: (1) selecting the more effective UI/UX design by predicting the A/B test verified winner and (2) assessing how well a model, given the winner, can explain its effectiveness in alignment with expert reasoning. Experiments across several MLLMs show that current models exhibit limited nuanced reasoning about UI/UX design and its behavioral impact. We believe our work will foster research in UI/UX understanding and enable broader applications such as behavior-aware interface optimization.
中文摘要:本文提出WiserUI-Bench这一新型基准,通过300组经A/B测试验证的真实界面设计对和专家解析,填补了当前界面设计评估中行为导向分析的空白,实验表明现有模型对界面用户体验设计的细微推理能力仍显不足。
English Summary: This paper introduces WiserUI-Bench, a novel benchmark addressing the gap in evaluating UI/UX design by focusing on behavior-oriented aspects through 300 real-world UI image pairs with expert rationales, revealing current models' limited nuanced reasoning about design effectiveness.

Authors:Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Jiang Zong, Hao Peng, Jianwei Yin
Title: Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization
Abstract:
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to downstream tasks. Since the majority of knowledge is acquired during pre-training, attributing the predictions of fine-tuned LLMs to their pre-training data may provide valuable insights. Influence functions have been proposed as a means to explain model predictions based on training data. However, existing approaches fail to compute ``multi-stage'' influence and lack scalability to billion-scale LLMs. In this paper, we propose the multi-stage influence function to attribute the downstream predictions of fine-tuned LLMs to pre-training data under the full-parameter fine-tuning paradigm. To enhance the efficiency and practicality of our multi-stage influence function, we leverage Eigenvalue-corrected Kronecker-Factored (EK-FAC) parameterization for efficient approximation. Empirical results validate the superior scalability of EK-FAC approximation and the effectiveness of our multi-stage influence function. Additionally, case studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power, with exemplars illustrating insights provided by multi-stage influence estimates. Our code is public at https://github.com/colored-dye/multi_stage_influence_function.
中文:本文提出多阶段影响函数,用于在完整参数微调下将精调后大语言模型的预测溯源至预训练数据,并通过EK-FAC近似提升效率,实证结果验证了其优越可扩展性与案例解释力。
English: This paper introduces a multi-stage influence function to trace fine-tuned LLM predictions back to pre-training data, using EK-FAC approximation for scalability and demonstrating its effectiveness through empirical validation and case studies.

Authors:Lizhe Fang, Yifei Wang, Khashayar Gatmiry, Lei Fang, Yisen Wang
Title: Rethinking Invariance in In-context Learning
Abstract:
In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed Invariant ICL (InvICL), a methodology designed to achieve invariance in ICL while ensuring the two properties. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths. Code is available at https://github.com/PKU-ML/InvICL.
Chinese: 上下文学习(ICL)存在对示例顺序敏感的问题,而提出的不变上下文学习(InvICL)方法通过确保信息不泄露和上下文相互依赖,在多数基准数据集上超越了现有模型,展现出卓越的泛化能力。
English: In-Context Learning (ICL) faces sensitivity to example order, but the proposed Invariant ICL (InvICL) method overcomes this by ensuring information non-leakage and context interdependence, achieving superior performance and generalization across benchmarks.

Authors:Fangwei Zhu, Peiyi Wang, Zhifang Sui
Title: Chain-of-Thought Tokens are Computer Program Variables
Abstract:
Chain-of-thoughts (CoT) requires large language models (LLMs) to generate intermediate steps before reaching the final answer, and has been proven effective to help LLMs solve complex reasoning tasks. However, the inner mechanism of CoT still remains largely unclear. In this paper, we empirically study the role of CoT tokens in LLMs on two compositional tasks: multi-digit multiplication and dynamic programming. While CoT is essential for solving these problems, we find that preserving only tokens that store intermediate results would achieve comparable performance. Furthermore, we observe that storing intermediate results in an alternative latent form will not affect model performance. We also randomly intervene some values in CoT, and notice that subsequent CoT tokens and the final answer would change correspondingly. These findings suggest that CoT tokens may function like variables in computer programs but with potential drawbacks like unintended shortcuts and computational complexity limits between tokens. The code and data are available at https://github.com/solitaryzero/CoTs_are_Variables.
中文:思维链(CoT)通过生成中间步骤帮助大语言模型解决复杂推理任务,本研究发现CoT标记类似程序中的变量,即使仅保留中间结果或以潜在形式存储,其性能仍可保持。
English: Chain-of-thoughts (CoT) enables large language models to solve complex reasoning tasks by generating intermediate steps, and this study reveals that CoT tokens function like variables in programs, with their performance maintained even when only intermediate results are preserved or altered in latent forms.

Authors:Md Aminul Islam, Ahmed Sayeed Faruk
Title: Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations
Abstract:
Recommender systems are essential for delivering personalized content across digital platforms by modeling user preferences and behaviors. Recently, large language models (LLMs) have been adopted for prompt-based recommendation due to their ability to generate personalized outputs without task-specific training. However, LLM-based methods face limitations such as limited context window size, inefficient pointwise and pairwise prompting, and difficulty handling listwise ranking due to token constraints. LLMs can also be sensitive to position bias, as they may overemphasize earlier items in the prompt regardless of their true relevance. To address and investigate these issues, we propose a hybrid framework that combines a traditional recommendation model with an LLM for reranking top-k items using structured prompts. We evaluate the effects of user history reordering and instructional prompts for mitigating position bias. Experiments on MovieLens-100K show that randomizing user history improves ranking quality, but LLM-based reranking does not outperform the base model. Explicit instructions to reduce position bias are also ineffective. Our evaluations reveal limitations in LLMs' ability to model ranking context and mitigate bias. Our code is publicly available at https://github.com/aminul7506/LLMForReRanking.
Chinese: 本研究提出了一种结合传统推荐模型与大语言模型的混合重排序框架,发现尽管随机化用户历史可提升排序质量,但大语言模型在缓解位置偏差方面存在局限,且重排序效果未超越基础模型。
English: This study introduces a hybrid framework combining traditional recommendation models with large language models (LLMs) for reranking, revealing that LLMs struggle with mitigating position bias and fail to outperform base models in ranking tasks despite user history randomization.

Authors:Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Title: Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Abstract:
Reasoning lies at the heart of intelligence, shaping the ability to make decisions, draw conclusions, and generalize across domains. In artificial intelligence, as systems increasingly operate in open, uncertain, and multimodal environments, reasoning becomes essential for enabling robust and adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a promising paradigm, integrating modalities such as text, images, audio, and video to support complex reasoning capabilities and aiming to achieve comprehensive perception, precise understanding, and deep reasoning. As research advances, multimodal reasoning has rapidly evolved from modular, perception-driven pipelines to unified, language-centric frameworks that offer more coherent cross-modal understanding. While instruction tuning and reinforcement learning have improved model reasoning, significant challenges remain in omni-modal generalization, reasoning depth, and agentic behavior. To address these issues, we present a comprehensive and structured survey of multimodal reasoning research, organized around a four-stage developmental roadmap that reflects the field's shifting design philosophies and emerging capabilities. First, we review early efforts based on task-specific modules, where reasoning was implicitly embedded across stages of representation, alignment, and fusion. Next, we examine recent approaches that unify reasoning into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT) and multimodal reinforcement learning enabling richer and more structured reasoning chains. Finally, drawing on empirical insights from challenging benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the conceptual direction of native large multimodal reasoning models (N-LMRMs), which aim to support scalable, agentic, and adaptive reasoning and planning in complex, real-world environments.
中文: 推理作为智能的核心,在开放多模态环境中对人工智能至关重要;大型多模态推理模型从模块化流程发展为统一框架,以应对泛化与自主行为等挑战,本文通过发展路线图对此进行了系统综述。
English: Reasoning is central to intelligence and crucial for robust AI in open, multimodal environments, with Large Multimodal Reasoning Models evolving from modular pipelines to unified frameworks to address challenges like generalization and agentic behavior, as surveyed through a developmental roadmap.

Authors:Hicham Assoudi
Title: A Comparative Benchmark of a Moroccan Darija Toxicity Detection Model (Typica.ai) and Major LLM-Based Moderation APIs (OpenAI, Mistral, Anthropic)
Abstract:
This paper presents a comparative benchmark evaluating the performance of Typica.ai's custom Moroccan Darija toxicity detection model against major LLM-based moderation APIs: OpenAI (omni-moderation-latest), Mistral (mistral-moderation-latest), and Anthropic Claude (claude-3-haiku-20240307). We focus on culturally grounded toxic content, including implicit insults, sarcasm, and culturally specific aggression often overlooked by general-purpose systems. Using a balanced test set derived from the OMCD_Typica.ai_Mix dataset, we report precision, recall, F1-score, and accuracy, offering insights into challenges and opportunities for moderation in underrepresented languages. Our results highlight Typica.ai's superior performance, underlining the importance of culturally adapted models for reliable content moderation.
中文摘要:本研究对Typica.ai的摩洛哥方言毒性检测模型与主流LLM审核API进行对比评估,结果表明该模型在识别文化特异性有害内容方面表现更优,凸显了文化适配模型的重要性。
English Summary: This study benchmarks Typica.ai's Moroccan Darija toxicity detection model against leading LLM moderation APIs, demonstrating its superior performance in identifying culturally specific toxic content through comprehensive metrics.

Authors:Kai Ruan, Mowen Huang, Ji-Rong Wen, Hao Sun
Title: Benchmarking LLMs' Swarm intelligence
Abstract:
Large Language Models (LLMs) show potential for complex reasoning, yet their capacity for emergent coordination in Multi-Agent Systems (MAS) when operating under strict swarm-like constraints-limited local perception and communication-remains largely unexplored. Existing benchmarks often do not fully capture the unique challenges of decentralized coordination when agents operate with incomplete spatio-temporal information. To bridge this gap, we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks (Pursuit, Synchronization, Foraging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely solely on local sensory input ($k\times k$ view) and local communication. We propose metrics for coordination effectiveness and analyze emergent group dynamics. Zero-shot evaluations of leading LLMs (e.g., deepseek-v3, o4-mini) reveal significant task-dependent performance variations. While some rudimentary coordination is observed, our results indicate that current LLMs significantly struggle with robust long-range planning and adaptive strategy formation under the uncertainty inherent in these decentralized scenarios. Assessing LLMs under such swarm-like constraints is crucial for understanding their utility in future decentralized intelligent systems. We release SwarmBench as an open, extensible toolkit-built on a customizable physical system-providing environments, prompts, evaluation scripts, and comprehensive datasets. This aims to foster reproducible research into LLM-based MAS coordination and the theoretical underpinnings of emergent collective behavior under severe informational decentralization. Our code repository is available at https://github.com/x66ccff/swarmbench.
中文: 大型语言模型在严格群体约束下的分散式多智能体系统中表现出有限的涌现协调能力,新型SwarmBench基准测试揭示了其在长程规划和适应性策略形成方面的显著不足。
English: Large Language Models exhibit limited emergent coordination in decentralized multi-agent systems under strict swarm constraints, as revealed by the novel SwarmBench benchmark which highlights their struggles with long-range planning and adaptive strategies.

Authors:Trinh T. L. Vuong, Jin Tae Kwak
Title: VideoPath-LLaVA: Pathology Diagnostic Reasoning Through Video Instruction Tuning
Abstract:
We present VideoPath-LLaVA, the first large multimodal model (LMM) in computational pathology that integrates three distinct image scenarios, single patch images, automatically keyframe-extracted clips, and manually segmented video pathology images, to mimic the natural diagnostic process of pathologists. By generating detailed histological descriptions and culminating in a definitive sign-out diagnosis, VideoPath-LLaVA bridges visual narratives with diagnostic reasoning. Central to our approach is the VideoPath-Instruct dataset, comprising 4278 video and diagnosis-specific chain-of-thought instructional pairs sourced from educational histopathology videos on YouTube. Although high-quality data is critical for enhancing diagnostic reasoning, its creation is time-intensive and limited in volume. To overcome this challenge, we transfer knowledge from existing single-image instruction datasets to train on weakly annotated, keyframe-extracted clips, followed by fine-tuning on manually segmented videos. VideoPath-LLaVA establishes a new benchmark in pathology video analysis and offers a promising foundation for future AI systems that support clinical decision-making through integrated visual and diagnostic reasoning. Our code, data, and model are publicly available at https://github.com/trinhvg/VideoPath-LLaVA.
中文:VideoPath-LLaVA是计算病理学中首个集成多种图像场景的大型多模态模型,通过模拟病理学家的自然诊断过程生成详细组织学描述和最终诊断,并借助创新的数据处理和训练方法设立了病理视频分析新基准。
English: VideoPath-LLaVA is the first large multimodal model in computational pathology that integrates multiple image scenarios to mimic pathologists' diagnostic process, generating detailed descriptions and definitive diagnoses while setting a new benchmark through innovative data handling and training methods.

Authors:Xu Huang, Yuefeng Huang, Weiwen Liu, Xingshan Zeng, Yasheng Wang, Ruiming Tang, Hong Xie, Defu Lian
Title: Advancing and Benchmarking Personalized Tool Invocation for LLMs
Abstract:
Tool invocation is a crucial mechanism for extending the capabilities of Large Language Models (LLMs) and has recently garnered significant attention. It enables LLMs to solve complex problems through tool calls while accessing up-to-date world knowledge. However, existing work primarily focuses on the fundamental ability of LLMs to invoke tools for problem-solving, without considering personalized constraints in tool invocation. In this work, we introduce the concept of Personalized Tool Invocation and define two key tasks: Tool Preference and Profile-dependent Query. Tool Preference addresses user preferences when selecting among functionally similar tools, while Profile-dependent Query considers cases where a user query lacks certain tool parameters, requiring the model to infer them from the user profile. To tackle these challenges, we propose PTool, a data synthesis framework designed for personalized tool invocation. Additionally, we construct \textbf{PTBench}, the first benchmark for evaluating personalized tool invocation. We then fine-tune various open-source models, demonstrating the effectiveness of our framework and providing valuable insights. Our benchmark is public at https://github.com/hyfshadow/PTBench.
中文: 本文提出个性化工具调用概念,通过PTool框架和PTBench基准测试解决工具选择偏好和用户档案相关查询问题,有效提升了大型语言模型的实际应用能力。
English: This paper introduces Personalized Tool Invocation, addressing user-specific constraints in tool selection and parameter inference through the proposed PTool framework and PTBench benchmark, demonstrating their effectiveness in enhancing LLM capabilities.

Authors:Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D. Manning, Peter Henderson, Daniel E. Ho
Title: A Reasoning-Focused Legal Retrieval Benchmark
Abstract:
As the legal community increasingly examines the use of large language models (LLMs) for various legal applications, legal AI developers have turned to retrieval-augmented LLMs ("RAG" systems) to improve system performance and robustness. An obstacle to the development of specialized RAG systems is the lack of realistic legal RAG benchmarks which capture the complexity of both legal retrieval and downstream legal question-answering. To address this, we introduce two novel legal RAG benchmarks: Bar Exam QA and Housing Statute QA. Our tasks correspond to real-world legal research tasks, and were produced through annotation processes which resemble legal research. We describe the construction of these benchmarks and the performance of existing retriever pipelines. Our results suggest that legal RAG remains a challenging application, thus motivating future research.

Authors:Gerrit Großmann, Larisa Ivanova, Sai Leela Poduru, Mohaddeseh Tabrizian, Islam Mesabah, David A. Selby, Sebastian J. Vollmer
Title: The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
Abstract:
According to Yuval Noah Harari, large-scale human cooperation is driven by shared narratives that encode common beliefs and values. This study explores whether such narratives can similarly nudge LLM agents toward collaboration. We use a finitely repeated public goods game in which LLM agents choose either cooperative or egoistic spending strategies. We prime agents with stories highlighting teamwork to different degrees and test how this influences negotiation outcomes. Our experiments explore four questions:(1) How do narratives influence negotiation behavior? (2) What differs when agents share the same story versus different ones? (3) What happens when the agent numbers grow? (4) Are agents resilient against self-serving negotiators? We find that story-based priming significantly affects negotiation strategies and success rates. Common stories improve collaboration, benefiting each agent. By contrast, priming agents with different stories reverses this effect, and those agents primed toward self-interest prevail. We hypothesize that these results carry implications for multi-agent system design and AI alignment.
中文摘要:研究表明,通过共享叙事引导LLM智能体可显著提升其在谈判中的协作表现,而相异的故事设定则会削弱合作效果,使利己策略占据上风。
English Summary: This study demonstrates that priming LLM agents with shared narratives significantly enhances their collaborative behavior in negotiations, whereas conflicting narratives undermine cooperation and favor self-interested strategies.

Authors:Md Fahim Anjum
Title: When Reasoning Beats Scale: A 1.5B Reasoning Model Outranks 13B LLMs as Discriminator
Abstract:
Large Language Models (LLM) with reasoning capabilities offer a promising path for improving candidate evaluation in planning frameworks, but their relative performance against traditional non-reasoning models remains largely underexplored. In this study, we benchmark a distilled 1.5B parameter reasoning model (DeepSeek-R1) against several state-of-the-art non-reasoning LLMs within a generator-discriminator LLM planning framework for the text-to-SQL task. For this, we introduce a novel method for extracting soft scores from the chain-of-thought (CoT) outputs from reasoning that enables fine-grained ranking of candidates. Our central hypothesis is that reasoning models are more effective discriminators than non-reasoning LLMs. Our results show that distilled DeepSeek-R1-1.5B achieves up to $87\%$ higher F1 and $3.7\%$ better discrimination accuracy than CodeLlama-7B, as well as $3.7\%$ higher execution accuracy than CodeLlama-13B, despite having significantly fewer parameters. Furthermore, we find that there is a limit to the logical capabilities of reasoning models, and only providing more context or allowing more compute budget for reasoning is not enough to improve their discrimination performance. Finally, we demonstrate that, unlike non-reasoning LLMs, reasoning models find generation more challenging than discrimination and may underperform as generators compared to smaller non-reasoning LLMs. Our work highlights the potential of reasoning models as discriminators in agentic frameworks, far outweighing their capabilities as generators, offering insights into their optimal role within LLM planning infrastructures.
中文: 推理模型在规划框架中作为判别器表现卓越,其文本转SQL能力远超参数更多的非推理模型,但在生成任务中存在局限,揭示了其在智能体框架中的最优角色定位。
English: Reasoning models like DeepSeek-R1 demonstrate superior discrimination capabilities in LLM planning frameworks, significantly outperforming larger non-reasoning models in text-to-SQL tasks despite having fewer parameters, while revealing limitations in generation tasks.

Authors:Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun
Title: VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Abstract:
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
中文: 针对流式场景中首个音频令牌生成延迟高的问题,VITA-Audio采用轻量级多模态令牌预测模块和渐进式训练策略,实现了3~5倍的推理加速,并在语音识别、文本转语音和口语问答任务中表现卓越。
English: To overcome the high latency in generating the first audio token during streaming, VITA-Audio introduces a lightweight MCTP module and a progressive training strategy, achieving a 3~5x inference speedup and superior performance in ASR, TTS, and SQA tasks.

Authors:Sharvi Endait, Ruturaj Ghatage, Aditya Kulkarni, Rajlaxmi Patil, Raviraj Joshi
Title: IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages
Abstract:
The rapid progress in question-answering (QA) systems has predominantly benefited high-resource languages, leaving Indic languages largely underrepresented despite their vast native speaker base. In this paper, we present IndicSQuAD, a comprehensive multi-lingual extractive QA dataset covering nine major Indic languages, systematically derived from the SQuAD dataset. Building on previous work with MahaSQuAD for Marathi, our approach adapts and extends translation techniques to maintain high linguistic fidelity and accurate answer-span alignment across diverse languages. IndicSQuAD comprises extensive training, validation, and test sets for each language, providing a robust foundation for model development. We evaluate baseline performances using language-specific monolingual BERT models and the multilingual MuRIL-BERT. The results indicate some challenges inherent in low-resource settings. Moreover, our experiments suggest potential directions for future work, including expanding to additional languages, developing domain-specific datasets, and incorporating multimodal data. The dataset and models are publicly shared at https://github.com/l3cube-pune/indic-nlp
中文: 本文介绍了IndicSQuAD,一个基于SQuAD构建的涵盖九种印度语言的多语言问答数据集,旨在解决这些语言在问答系统中的代表性不足问题,并通过基线模型评估揭示了当前面临的挑战和未来研究方向。
English: The paper introduces IndicSQuAD, a multilingual QA dataset for nine Indic languages derived from SQuAD, addressing the underrepresentation of these languages in QA systems and evaluating baseline models to highlight challenges and future research directions.

Authors:Arthur Satouf, Gabriel Ben Zenou, Benjamin Piwowarski, Habiboulaye Amadou Boubacar, Pablo Piantanida
Title: Rational Retrieval Acts: Leveraging Pragmatic Reasoning to Improve Sparse Retrieval
Abstract:
Current sparse neural information retrieval (IR) methods, and to a lesser extent more traditional models such as BM25, do not take into account the document collection and the complex interplay between different term weights when representing a single document. In this paper, we show how the Rational Speech Acts (RSA), a linguistics framework used to minimize the number of features to be communicated when identifying an object in a set, can be adapted to the IR case -- and in particular to the high number of potential features (here, tokens). RSA dynamically modulates token-document interactions by considering the influence of other documents in the dataset, better contrasting document representations. Experiments show that incorporating RSA consistently improves multiple sparse retrieval models and achieves state-of-the-art performance on out-of-domain datasets from the BEIR benchmark. https://github.com/arthur-75/Rational-Retrieval-Acts
中文摘要:本文采用理性言语行为框架,通过基于整个文档集合动态调整词项与文档的交互作用来改进稀疏神经信息检索模型,从而在跨领域数据集上取得了性能提升和领先成果。
English Summary: The paper adapts the Rational Speech Acts framework to enhance sparse neural information retrieval by dynamically adjusting token-document interactions based on the entire document collection, leading to improved performance and state-of-the-art results on out-of-domain datasets.

Authors:Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah
Title: RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Abstract:
We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than \$2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the Qwen License Agreement. Models at https://huggingface.co/collections/recursal/radlads-6818ee69e99e729ba8a87102 Training Code at https://github.com/recursal/RADLADS-paper
中文:RADLADS协议能以极少的训练数据和成本将softmax注意力变换器高效转换为线性注意力解码器,在保持接近原模型质量的同时实现领先性能,并以开放许可证发布。
English: RADLADS enables efficient conversion of softmax attention transformers into linear attention decoders using minimal training tokens and cost, achieving near-original quality and state-of-the-art performance while being released under open licenses.

Authors:Franklin Zhang, Sonya Zhang, Alon Halevy
Title: Leveraging LLMs to Create Content Corpora for Niche Domains
Abstract:
Constructing specialized content corpora from vast, unstructured web sources for domain-specific applications poses substantial data curation challenges. In this paper, we introduce a streamlined approach for generating high-quality, domain-specific corpora by efficiently acquiring, filtering, structuring, and cleaning web-based data. We showcase how Large Language Models (LLMs) can be leveraged to address complex data curation at scale, and propose a strategical framework incorporating LLM-enhanced techniques for structured content extraction and semantic deduplication. We validate our approach in the behavior education domain through its integration into 30 Day Me, a habit formation application. Our data pipeline, named 30DayGen, enabled the extraction and synthesis of 3,531 unique 30-day challenges from over 15K webpages. A user survey reports a satisfaction score of 4.3 out of 5, with 91% of respondents indicating willingness to use the curated content for their habit-formation goals.
中文: 本文提出了一种利用大型语言模型从网络数据中高效构建高质量领域专用语料库的简化方法,通过在习惯养成应用中的验证,成功提取了数千条挑战内容并获得了用户的高度满意度。
English: This paper introduces a streamlined approach using Large Language Models to efficiently create high-quality, domain-specific corpora from web data, validated through a habit-formation application where it successfully extracted thousands of challenges and achieved high user satisfaction.

Authors:Bang Zhang, Ruotian Ma, Qingxuan Jiang, Peisong Wang, Jiaqi Chen, Zheng Xie, Xingyu Chen, Yue Wang, Fanghua Ye, Jian Li, Yifan Yang, Zhaopeng Tu, Xiaolong Li
Title: Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Abstract:
Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge. To bridge the gap, we introduce Sentient Agent as a Judge (SAGE), an automated evaluation framework that measures an LLM's higher-order social cognition. SAGE instantiates a Sentient Agent that simulates human-like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts. Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4x) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g., Arena). SAGE thus provides a principled, scalable and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
中文摘要:SAGE框架通过在多轮对话中模拟类人情感变化与内心活动,评估大语言模型的社会认知能力,揭示了前沿模型与早期基线之间的显著差距,这些差距是传统评测体系未能体现的。
English Summary: The SAGE framework evaluates large language models' social cognition by simulating human-like emotional responses and inner thoughts during conversations, revealing significant performance gaps between advanced and baseline models that traditional benchmarks miss.

Authors:Yi-Fan Zhang, Xingyu Lu, Xiao Hu, Chaoyou Fu, Bin Wen, Tianke Zhang, Changyi Liu, Kaiyu Jiang, Kaibing Chen, Kaiyu Tang, Haojie Ding, Jiankang Chen, Fan Yang, Zhang Zhang, Tingting Gao, Liang Wang
Title: R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning
Abstract:
Multimodal Reward Models (MRMs) play a crucial role in enhancing the performance of Multimodal Large Language Models (MLLMs). While recent advancements have primarily focused on improving the model structure and training data of MRMs, there has been limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these capabilities in MRMs. In this paper, we explore how Reinforcement Learning (RL) can be used to improve reward modeling. Specifically, we reformulate the reward modeling problem as a rule-based RL task. However, we observe that directly applying existing RL algorithms, such as Reinforce++, to reward modeling often leads to training instability or even collapse due to the inherent limitations of these algorithms. To address this issue, we propose the StableReinforce algorithm, which refines the training loss, advantage estimation strategy, and reward design of existing RL methods. These refinements result in more stable training dynamics and superior performance. To facilitate MRM training, we collect 200K preference data from diverse datasets. Our reward model, R1-Reward, trained using the StableReinforce algorithm on this dataset, significantly improves performance on multimodal reward modeling benchmarks. Compared to previous SOTA models, R1-Reward achieves a $8.4\%$ improvement on the VL Reward-Bench and a $14.3\%$ improvement on the Multimodal Reward Bench. Moreover, with more inference compute, R1-Reward's performance is further enhanced, highlighting the potential of RL algorithms in optimizing MRMs.
中文: 本文提出StableReinforce算法,通过优化训练损失、优势估计和奖励设计来稳定强化学习训练,在构建的多模态偏好数据集上训练的奖励模型在各项基准测试中取得了显著性能提升。
English: This paper introduces StableReinforce, a refined reinforcement learning algorithm that stabilizes training and enhances multimodal reward models, achieving significant performance improvements on benchmarks with collected preference data.

Authors:Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
Title: ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Abstract:
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation, which approximates the pruned blocks. The estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at https://github.com/mts-ai/ReplaceMe.
Chinese: ReplaceMe 是一种无需训练的深度剪枝方法,通过线性操作替换 transformer 模块,在无需重新训练的情况下实现高达 25% 的剪枝率,同时保持约 90% 的原始性能。
English: ReplaceMe is a training-free depth pruning method that replaces transformer blocks with a linear operation, achieving up to 25% pruning while maintaining approximately 90% of original performance without retraining.

Authors:Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux
Title: fastabx: A library for efficient computation of ABX discriminability
Abstract:
We introduce fastabx, a high-performance Python library for building ABX discrimination tasks. ABX is a measure of the separation between generic categories of interest. It has been used extensively to evaluate phonetic discriminability in self-supervised speech representations. However, its broader adoption has been limited by the absence of adequate tools. fastabx addresses this gap by providing a framework capable of constructing any type of ABX task while delivering the efficiency necessary for rapid development cycles, both in task creation and in calculating distances between representations. We believe that fastabx will serve as a valuable resource for the broader representation learning community, enabling researchers to systematically investigate what information can be directly extracted from learned representations across several domains beyond speech processing. The source code is available at https://github.com/bootphon/fastabx.
中文:Fastabx 是一个高性能的 Python 库,旨在高效构建和计算 ABX 判别任务,填补了在语音处理之外多个领域中评估学习表示类别区分能力的工具空白。
English: Fastabx is a high-performance Python library designed to efficiently build and compute ABX discrimination tasks, addressing the lack of tools for evaluating category separation in learned representations across various domains beyond speech processing.

Authors:Xiaobao Wu
Title: Sailing by the Stars: A Survey on Reward Models and Learning Strategies for Learning from Rewards
Abstract:
Recent developments in Large Language Models (LLMs) have shifted from pre-training scaling to post-training and test-time scaling. Across these developments, a key unified paradigm has arisen: Learning from Rewards, where reward signals act as the guiding stars to steer LLM behavior. It has underpinned a wide range of prevalent techniques, such as reinforcement learning (RLHF, RLAIF, DPO, and GRPO), reward-guided decoding, and post-hoc correction. Crucially, this paradigm enables the transition from passive learning from static data to active learning from dynamic feedback. This endows LLMs with aligned preferences and deep reasoning capabilities for diverse tasks. In this survey, we present a comprehensive overview of learning from rewards, from the perspective of reward models and learning strategies across training, inference, and post-inference stages. We further discuss the benchmarks for reward models and the primary applications. Finally we highlight the challenges and future directions. We maintain a paper collection at https://github.com/bobxwu/learning-from-rewards-llm-papers.
Chinese Summary: 近期大语言模型的发展重点转向基于奖励的学习范式,通过强化学习、奖励引导解码等技术,利用奖励信号指导模型行为,实现从静态数据被动学习到动态反馈主动学习的转变,从而增强模型的对齐能力和深度推理能力。
English Summary: Recent advancements in large language models are increasingly centered on learning from rewards, a paradigm that uses reward signals to guide model behavior through techniques like reinforcement learning and reward-guided decoding, enabling active learning from dynamic feedback for improved alignment and reasoning.

Authors:Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, Yang Feng
Title: LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Abstract:
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.
中文: 本文介绍了LLaMA-Omni 2系列语音语言模型,通过集成语音编码器和流式解码器实现高质量实时语音交互,在仅使用少量训练数据的情况下,其性能超越了基于海量语音数据训练的现有最优模型。
English: This paper presents LLaMA-Omni 2, a series of speech language models that enable high-quality real-time speech interaction by integrating a speech encoder and streaming decoder, achieving superior performance on benchmarks despite minimal training data.

Authors:Jiarui Yao, Yifan Hao, Hanning Zhang, Hanze Dong, Wei Xiong, Nan Jiang, Tong Zhang
Title: Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL
Abstract:
Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.
中文摘要:GVM-RAFT提出动态采样策略,通过监控接受率和梯度范数优化计算资源分配,在数学推理任务中相比传统RAFT实现2-4倍加速收敛与显著精度提升。
English Summary: GVM-RAFT introduces a dynamic sampling strategy that optimizes computational resource allocation by minimizing gradient variance, achieving 2-4x faster convergence and higher accuracy in mathematical reasoning tasks compared to standard RAFT.

Authors:Xiusi Chen, Gaotang Li, Ziqi Wang, Bowen Jin, Cheng Qian, Yu Wang, Hongru Wang, Yu Zhang, Denghui Zhang, Tong Zhang, Hanghang Tong, Heng Ji
Title: RM-R1: Reward Modeling as Reasoning
Abstract:
Reward modeling is essential for aligning large language models with human preferences through reinforcement learning from human feedback. To provide accurate reward signals, a reward model (RM) should stimulate deep thinking and conduct interpretable reasoning before assigning a score or a judgment. Inspired by recent advances of long chain-of-thought on reasoning-intensive tasks, we hypothesize and validate that integrating reasoning capabilities into reward modeling significantly enhances RMs interpretability and performance. To this end, we introduce a new class of generative reward models - Reasoning Reward Models (ReasRMs) - which formulate reward modeling as a reasoning task. We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1. RM-R1 features a chain-of-rubrics (CoR) mechanism - self-generating sample-level chat rubrics or math/code solutions, and evaluating candidate responses against them. The training of RM-R1 consists of two key stages: (1) distillation of high-quality reasoning chains and (2) reinforcement learning with verifiable rewards. Empirically, our models achieve state-of-the-art performance across three reward model benchmarks on average, outperforming much larger open-weight models (e.g., INF-ORM-Llama3.1-70B) and proprietary ones (e.g., GPT-4o) by up to 4.9%. Beyond final performance, we perform thorough empirical analyses to understand the key ingredients of successful ReasRM training. To facilitate future research, we release six REASRM models along with code and data at https://github.com/RM-R1-UIUC/RM-R1.
中文: 本文提出推理奖励模型(ReasRMs),通过引入推理能力和链式评分机制,结合两阶段训练流程,显著提升了奖励模型的性能和可解释性,在多个基准测试中达到最优水平,并以高达4.9%的优势超越更大规模的模型。
English: This paper introduces Reasoning Reward Models (ReasRMs), which enhance reward modeling by incorporating reasoning capabilities through a chain-of-rubrics mechanism and a two-stage training process, achieving state-of-the-art performance across benchmarks while outperforming larger models by up to 4.9%.

Authors:Henry Ndubuaku, Mouad Talhi
Title: Parameter-Efficient Transformer Embeddings
Abstract:
Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which token embedding vectors are first generated deterministically, directly from the token IDs using a Fourier expansion of their normalized values, followed by a lightweight multilayer perceptron (MLP) that captures higher-order interactions. We train standard transformers and our architecture on natural language inference tasks (SNLI and MNLI), and evaluate zero-shot performance on sentence textual similarity (STS-B). Our results demonstrate that the proposed method achieves competitive performance using significantly fewer parameters, trains faster, and operates effectively without the need for dropout. This proof-of-concept study highlights the potential for scalable, memory-efficient language models and motivates further large-scale experimentation based on our findings.
中文: 我们提出的方法用基于傅里叶变换的确定性标记生成和轻量级多层感知机替代传统嵌入层,以更少参数和更快训练实现同等性能,且无需使用dropout技术。
English: Our proposed method replaces traditional embedding layers with a deterministic Fourier-based token generation followed by a lightweight MLP, achieving competitive performance with fewer parameters and faster training while eliminating dropout requirements.

Authors:Minzheng Wang, Yongbin Li, Haobo Wang, Xinghua Zhang, Nan Xu, Bingli Wu, Fei Huang, Haiyang Yu, Wenji Mao
Title: Adaptive Thinking via Mode Policy Optimization for Social Language Agents
Abstract:
Effective social intelligence simulation requires language agents to dynamically adjust reasoning depth, a capability notably absent in current studies. Existing methods either lack this kind of reasoning capability or enforce Long Chain-of-Thought reasoning uniformly across all scenarios, resulting in excessive token usage and inflexible social simulation. To address this, we propose an $\textbf{A}$daptive $\textbf{M}$ode $\textbf{L}$earning ($\textbf{AML}$) framework in this paper, aiming to improve the adaptive thinking ability of language agents in dynamic social interactions. To this end, we first identify hierarchical thinking modes ranging from intuitive response to deep deliberation based on the cognitive control theory. We then develop the $\textbf{A}$daptive $\textbf{M}$ode $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{AMPO}$) algorithm to optimize the context-aware mode switching and reasoning. Our framework advances existing research in three key aspects: (1) Multi-granular thinking mode design, (2) Context-aware mode switching across social interaction, and (3) Token-efficient reasoning via depth-adaptive processing. Extensive experiments on social intelligence benchmarks verify that AML achieves 15.6% higher task performance than GPT-4o. Notably, our AMPO outperforms GRPO by 7.0% with 32.8% shorter reasoning chains, demonstrating the advantage of adaptive thinking mode selection and optimization mechanism in AMPO over GRPO's fixed-depth solution.
中文: 本文提出自适应模式学习(AML)框架及自适应模式策略优化(AMPO)算法,通过多粒度思维模式设计和情境感知的模式切换机制,显著提升了语言代理在社交互动中的动态推理能力,实验证明其以更短推理链获得优于现有方法的性能表现。
English: This paper introduces an Adaptive Mode Learning (AML) framework with an Adaptive Mode Policy Optimization (AMPO) algorithm to enhance language agents' dynamic reasoning depth in social interactions, achieving superior performance over existing methods through context-aware mode switching and token-efficient processing.

Authors:Zhong Guan, Likang Wu, Hongke Zhao, Ming He, Jianpin Fan
Title: Attention Mechanisms Perspective: Exploring LLM Processing of Graph-Structured Data
Abstract:
Attention mechanisms are critical to the success of large language models (LLMs), driving significant advancements in multiple fields. However, for graph-structured data, which requires emphasis on topological connections, they fall short compared to message-passing mechanisms on fixed links, such as those employed by Graph Neural Networks (GNNs). This raises a question: ``Does attention fail for graphs in natural language settings?'' Motivated by these observations, we embarked on an empirical study from the perspective of attention mechanisms to explore how LLMs process graph-structured data. The goal is to gain deeper insights into the attention behavior of LLMs over graph structures. We uncovered unique phenomena regarding how LLMs apply attention to graph-structured data and analyzed these findings to improve the modeling of such data by LLMs. The primary findings of our research are: 1) While LLMs can recognize graph data and capture text-node interactions, they struggle to model inter-node relationships within graph structures due to inherent architectural constraints. 2) The attention distribution of LLMs across graph nodes does not align with ideal structural patterns, indicating a failure to adapt to graph topology nuances. 3) Neither fully connected attention nor fixed connectivity is optimal; each has specific limitations in its application scenarios. Instead, intermediate-state attention windows improve LLM training performance and seamlessly transition to fully connected windows during inference. Source code: \href{https://github.com/millioniron/LLM_exploration}{LLM4Exploration}
中文: 大语言模型的注意力机制因架构限制难以处理图结构数据,无法有效建模节点间关系和适应拓扑特征,但中间态注意力窗口能提升训练效果并平滑过渡到推理阶段。
English: Attention mechanisms in LLMs struggle with graph-structured data due to architectural limitations, failing to model inter-node relationships and adapt to topological nuances, but intermediate attention windows offer improved training and inference performance.

Authors:Joy Lim Jia Yin, Daniel Zhang-Li, Jifan Yu, Haoxuan Li, Shangqing Tu, Yuanchun Wang, Zhiyuan Liu, Huiqin Liu, Lei Hou, Juanzi Li, Bin Xu
Title: LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Abstract:
Evaluating the quality of slide-based multimedia instruction is challenging. Existing methods like manual assessment, reference-based metrics, and large language model evaluators face limitations in scalability, context capture, or bias. In this paper, we introduce LecEval, an automated metric grounded in Mayer's Cognitive Theory of Multimedia Learning, to evaluate multimodal knowledge acquisition in slide-based learning. LecEval assesses effectiveness using four rubrics: Content Relevance (CR), Expressive Clarity (EC), Logical Structure (LS), and Audience Engagement (AE). We curate a large-scale dataset of over 2,000 slides from more than 50 online course videos, annotated with fine-grained human ratings across these rubrics. A model trained on this dataset demonstrates superior accuracy and adaptability compared to existing metrics, bridging the gap between automated and human assessments. We release our dataset and toolkits at https://github.com/JoylimJY/LecEval.
中文: LecEval基于梅耶多媒体学习认知理论,通过四个评估维度自动评价幻灯片教学效果,并在大规模标注数据集上展现出优于现有方法的准确性和适应性。
English: LecEval introduces an automated evaluation metric based on Mayer's Cognitive Theory to assess slide-based multimedia instruction through four rubrics, demonstrating superior accuracy and adaptability on a large-scale annotated dataset compared to existing methods.

Authors:Anthony Nguyen, Wenjun Lin
Title: Intra-Layer Recurrence in Transformers for Language Modeling
Abstract:
Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
中文摘要:层内循环(ILR)通过在单次前向传播中选择性地对单个层应用循环机制,实验表明优先对早期层进行迭代可获得最佳性能。
English Summary: Intra-Layer Recurrence (ILR) selectively applies recurrence to individual transformer layers within a single forward pass, with experiments showing optimal performance when prioritizing earlier layers for iteration.

Authors:Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee
Title: A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
Abstract:
Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workloads such as chain-of-thought, complex reasoning, and agent services significantly increase the inference cost by invoking the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking. This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions. We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/sihyeong/Awesome-LLM-Inference-Engine
Chinese: 本文对25个大语言模型推理引擎进行全面评估,分析其性能、设计目标和生态成熟度,为研究人员和开发者选择及优化适用于多样化服务需求的系统提供实用指导。
English: This paper conducts a comprehensive evaluation of 25 LLM inference engines, assessing their performance, design goals, and ecosystem maturity to guide researchers and developers in selecting and optimizing these systems for diverse service requirements.

Authors:Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, Mohit Bansal
Title: Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Abstract:
LLMs trained on massive datasets may inadvertently acquire sensitive information such as personal details and potentially harmful content. This risk is further heightened in multimodal LLMs as they integrate information from multiple modalities (image and text). Adversaries can exploit this knowledge through multimodal prompts to extract sensitive details. Evaluating how effectively MLLMs can forget such information (targeted unlearning) necessitates the creation of high-quality, well-annotated image-text pairs. While prior work on unlearning has focused on text, multimodal unlearning remains underexplored. To address this gap, we first introduce a multimodal unlearning benchmark, UnLOK-VQA (Unlearning Outside Knowledge VQA), as well as an attack-and-defense framework to evaluate methods for deleting specific multimodal knowledge from MLLMs. We extend a visual question-answering dataset using an automated pipeline that generates varying-proximity samples for testing generalization and specificity, followed by manual filtering for maintaining high quality. We then evaluate six defense objectives against seven attacks (four whitebox, three blackbox), including a novel whitebox method leveraging interpretability of hidden states. Our results show multimodal attacks outperform text- or image-only ones, and that the most effective defense removes answer information from internal model states. Additionally, larger models exhibit greater post-editing robustness, suggesting that scale enhances safety. UnLOK-VQA provides a rigorous benchmark for advancing unlearning in MLLMs.
中文: 大型多模态语言模型存在记忆敏感信息的风险,可通过多模态提示提取,为此开发了UnLOK-VQA基准来评估定向遗忘方法,结果显示多模态攻击更有效,而更大模型具有更强的安全鲁棒性。
English: Large multimodal language models risk memorizing sensitive data, which can be extracted via multimodal prompts, prompting the development of UnLOK-VQA as a benchmark to evaluate targeted unlearning methods and revealing that multimodal attacks are more effective while larger models offer better safety.

Authors:Carlo Siebenschuh, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Arham Khan, Khalid Hossain, Yadu Babuji, Nicholas Chia, Venkatram Vishwanath, Rick Stevens, Arvind Ramanathan, Ian Foster, Robert Underwood
Title: AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine
Abstract:
Language models for scientific tasks are trained on text from scientific publications, most distributed as PDFs that require parsing. PDF parsing approaches range from inexpensive heuristics (for simple documents) to computationally intensive ML-driven systems (for complex or degraded ones). The choice of the "best" parser for a particular document depends on its computational cost and the accuracy of its output. To address these issues, we introduce an Adaptive Parallel PDF Parsing and Resource Scaling Engine (AdaParse), a data-driven strategy for assigning an appropriate parser to each document. We enlist scientists to select preferred parser outputs and incorporate this information through direct preference optimization (DPO) into AdaParse, thereby aligning its selection process with human judgment. AdaParse then incorporates hardware requirements and predicted accuracy of each parser to orchestrate computational resources efficiently for large-scale parsing campaigns. We demonstrate that AdaParse, when compared to state-of-the-art parsers, improves throughput by $17\times$ while still achieving comparable accuracy (0.2 percent better) on a benchmark set of 1000 scientific documents. AdaParse's combination of high accuracy and parallel scalability makes it feasible to parse large-scale scientific document corpora to support the development of high-quality, trillion-token-scale text datasets. The implementation is available at https://github.com/7shoe/AdaParse/
中文摘要:AdaParse是一种自适应PDF解析引擎,通过整合人类偏好和优化资源分配,为每份科学文献智能匹配最适合的解析器,在大规模解析任务中实现17倍吞吐量提升并保持相当的准确率。
English Summary: AdaParse is an adaptive engine that efficiently assigns the most suitable PDF parser to each scientific document by incorporating human preferences and optimizing resource allocation, achieving 17 times higher throughput with comparable accuracy for large-scale parsing.

Authors:Murtadha Ahmed, Wenbo, Liu yunfeng
Title: MateICL: Mitigating Attention Dispersion in Large-Scale In-Context Learning
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in In-Context Learning (ICL). However, the fixed position length constraints in pre-trained models limit the number of demonstration examples. Recent efforts to extend context suffer from attention dispersion as the number of demonstrations increases. In this paper, we introduce Mitigating Attention Dispersion in large-scale ICL (MateICL) that enables LLMs to maintain effective self-attention as the context size grows. We first split the context into multiple windows, each filled to the model's context capacity, which are processed separately. Then, we introduce an additional layer to recalibrate the attention weights, prioritizing the query tokens as the number of demonstrations increases. Our empirical results show that MateICL can effectively leverage larger contexts to improve ICL performance. Compared to retrieval-based baselines, MateICL consistently achieves better performance without requiring an externally trained retrieval model. Despite recent advances in inference strategies (e.g., 32k token contexts), our results demonstrate that MateICL remains beneficial in computationally resource-constrained settings. The code is publicly available at https://github.com/amurtadha/MateICL.
中文: 本文提出MateICL方法,通过分割上下文窗口并重新校准注意力权重来缓解大规模上下文学习中的注意力分散问题,使大语言模型能有效利用更长上下文,无需外部检索模型即可超越基于检索的基线方法。
English: This paper introduces MateICL, a method that splits context into windows and recalibrates attention weights to mitigate attention dispersion in large-scale in-context learning, enabling LLMs to effectively utilize larger contexts and outperform retrieval-based approaches without external models.

Authors:Quang P. M. Pham, Khoi T. N. Nguyen, Nhi H. Doan, Cuong A. Pham, Qinbo Sun, Weimin Qi, Kentaro Inui, Dezhen Song
Title: SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Abstract:
Efficient path planning in robotics, particularly within large-scale, complex environments, remains a significant hurdle. While Large Language Models (LLMs) offer strong reasoning capabilities, their high computational cost and limited adaptability hinder real-time deployment on edge devices. We present SmallPlan - a novel framework leveraging LLMs as teacher models to train lightweight Small Language Models (SLMs) for high-level path planning tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate across scene graphs that compactly represent full-scaled 3D scenes. The SLMs are trained in a simulation-powered, interleaved manner with LLM-guided supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not only enables SLMs to successfully complete navigation tasks but also makes them aware of important factors like distance travel, providing more efficient path planning. Through experiments, we demonstrate that the fine-tuned SLMs perform competitively with larger models like GPT-4o on sequential path planning, without suffering from hallucination and overfitting. SmallPlan is resource-efficient, making it well-suited for edge-device deployment and advancing practical autonomous robotics. Our source code is available here: https://github.com/quangpham2006/SmallPlan
中文: SmallPlan提出了一种创新框架,利用大型语言模型作为教师模型来训练轻量级小语言模型,实现机器人高效路径规划,在保持与大型模型竞争性能的同时具备资源效率,适用于边缘设备部署。
English: SmallPlan introduces a novel framework that uses Large Language Models as teachers to train lightweight Small Language Models for efficient path planning in robotics, enabling competitive performance with larger models while being resource-efficient for edge-device deployment.

Authors:Henry Peng Zou, Wei-Chieh Huang, Yaozu Wu, Yankai Chen, Chunyu Miao, Hoang Nguyen, Yue Zhou, Weizhi Zhang, Liancheng Fang, Langzhou He, Yangning Li, Dongyuan Li, Renhe Jiang, Xue Liu, Philip S. Yu
Title: LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey
Abstract:
Recent advances in large language models (LLMs) have sparked growing interest in building fully autonomous agents. However, fully autonomous LLM-based agents still face significant challenges, including limited reliability due to hallucinations, difficulty in handling complex tasks, and substantial safety and ethical risks, all of which limit their feasibility and trustworthiness in real-world applications. To overcome these limitations, LLM-based human-agent systems (LLM-HAS) incorporate human-provided information, feedback, or control into the agent system to enhance system performance, reliability and safety. These human-agent collaboration systems enable humans and LLM-based agents to collaborate effectively by leveraging their complementary strengths. This paper provides the first comprehensive and structured survey of LLM-HAS. It clarifies fundamental concepts, systematically presents core components shaping these systems, including environment & profiling, human feedback, interaction types, orchestration and communication, explores emerging applications, and discusses unique challenges and opportunities arising from human-AI collaboration. By consolidating current knowledge and offering a structured overview, we aim to foster further research and innovation in this rapidly evolving interdisciplinary field. Paper lists and resources are available at https://github.com/HenryPengZou/Awesome-Human-Agent-Collaboration-Interaction-Systems.
Chinese: 本文首次系统综述了基于大语言模型的人机协作系统,通过整合人类输入来提升系统性能、可靠性和安全性,同时应对自主智能体存在的幻觉与伦理风险等挑战。
English: This paper presents the first comprehensive survey of LLM-based human-agent systems (LLM-HAS), which integrate human input to enhance performance, reliability, and safety while addressing challenges like hallucinations and ethical risks in autonomous agents.

Authors:Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li
Title: T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT
Abstract:
Recent advancements in large language models have demonstrated how chain-of-thought (CoT) and reinforcement learning (RL) can improve performance. However, applying such reasoning strategies to the visual generation domain remains largely unexplored. In this paper, we present T2I-R1, a novel reasoning-enhanced text-to-image generation model, powered by RL with a bi-level CoT reasoning process. Specifically, we identify two levels of CoT that can be utilized to enhance different stages of generation: (1) the semantic-level CoT for high-level planning of the prompt and (2) the token-level CoT for low-level pixel processing during patch-by-patch generation. To better coordinate these two levels of CoT, we introduce BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes both generation CoTs within the same training step. By applying our reasoning strategies to the baseline model, Janus-Pro, we achieve superior performance with 13% improvement on T2I-CompBench and 19% improvement on the WISE benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available at: https://github.com/CaraJ7/T2I-R1
中文: 本文提出T2I-R1模型,通过结合双层思维链推理与强化学习,在语义规划和像素处理层面优化文本到图像生成,显著提升了生成性能并超越了现有先进模型。
English: This paper introduces T2I-R1, a reasoning-enhanced text-to-image model that integrates a bi-level chain-of-thought process with reinforcement learning to optimize both semantic planning and pixel processing, achieving significant performance improvements over baseline and state-of-the-art models.

Authors:Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, Jeff Z. Pan
Title: Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Abstract:
Memory is a fundamental component of AI systems, underpinning large language models (LLMs)-based agents. While prior surveys have focused on memory applications with LLMs (e.g., enabling personalized memory in conversational agents), they often overlook the atomic operations that underlie memory dynamics. In this survey, we first categorize memory representations into parametric and contextual forms, and then introduce six fundamental memory operations: Consolidation, Updating, Indexing, Forgetting, Retrieval, and Compression. We map these operations to the most relevant research topics across long-term, long-context, parametric modification, and multi-source memory. By reframing memory systems through the lens of atomic operations and representation types, this survey provides a structured and dynamic perspective on research, benchmark datasets, and tools related to memory in AI, clarifying the functional interplay in LLMs based agents while outlining promising directions for future research\footnote{The paper list, datasets, methods and tools are available at \href{https://github.com/Elvin-Yiming-Du/Survey_Memory_in_AI}{https://github.com/Elvin-Yiming-Du/Survey\_Memory\_in\_AI}.}.
本调查通过将记忆表征分类为参数化与情境化形式并引入六种基本记忆操作,重构了人工智能记忆系统,为基于大语言模型的智能体研究提供了结构化视角并指明了未来方向。
This survey reframes AI memory systems by categorizing representations into parametric and contextual forms and introducing six fundamental memory operations, offering a structured perspective on research and future directions in LLM-based agents.

Authors:Wenkai Yang, Jingwen Chen, Yankai Lin, Ji-Rong Wen
Title: DeepCritic: Deliberate Critique with Large Language Models
Abstract:
As Large Language Models (LLMs) are rapidly evolving, providing accurate feedback and scalable oversight on their outputs becomes an urgent and critical problem. Leveraging LLMs as critique models to achieve automated supervision is a promising solution. In this work, we focus on studying and enhancing the math critique ability of LLMs. Current LLM critics provide critiques that are too shallow and superficial on each step, leading to low judgment accuracy and struggling to offer sufficient feedback for the LLM generator to correct mistakes. To tackle this issue, we propose a novel and effective two-stage framework to develop LLM critics that are capable of deliberately critiquing on each reasoning step of math solutions. In the first stage, we utilize Qwen2.5-72B-Instruct to generate 4.5K long-form critiques as seed data for supervised fine-tuning. Each seed critique consists of deliberate step-wise critiques that includes multi-perspective verifications as well as in-depth critiques of initial critiques for each reasoning step. Then, we perform reinforcement learning on the fine-tuned model with either existing human-labeled data from PRM800K or our automatically annotated data obtained via Monte Carlo sampling-based correctness estimation, to further incentivize its critique ability. Our developed critique model built on Qwen2.5-7B-Instruct not only significantly outperforms existing LLM critics (including the same-sized DeepSeek-R1-distill models and GPT-4o) on various error identification benchmarks, but also more effectively helps the LLM generator refine erroneous steps through more detailed feedback.
Chinese: 本研究提出一个两阶段框架,通过先使用精细化分步评论进行监督微调,再结合强化学习,显著提升大语言模型的数学批判能力,最终模型在错误识别和纠错反馈方面均优于现有评论模型。
English: This study introduces a two-stage framework to enhance LLMs' math critique ability by first fine-tuning with deliberate step-wise critiques and then applying reinforcement learning, resulting in a model that outperforms existing critics and improves error correction.

Authors:Marco Braga, Pranav Kasela, Alessandro Raganato, Gabriella Pasi
Title: Investigating Task Arithmetic for Zero-Shot Information Retrieval
Abstract:
Large Language Models (LLMs) have shown impressive zero-shot performance across a variety of Natural Language Processing tasks, including document re-ranking. However, their effectiveness degrades on unseen tasks and domains, largely due to shifts in vocabulary and word distributions. In this paper, we investigate Task Arithmetic, a technique that combines the weights of LLMs pre-trained on different tasks or domains via simple mathematical operations, such as addition or subtraction, to adapt retrieval models without requiring additional fine-tuning. Our method is able to synthesize diverse tasks and domain knowledge into a single model, enabling effective zero-shot adaptation in different retrieval contexts. Extensive experiments on publicly available scientific, biomedical, and multilingual datasets show that our method improves state-of-the-art re-ranking performance by up to 18% in NDCG@10 and 15% in P@10. In addition to these empirical gains, our analysis provides insights into the strengths and limitations of Task Arithmetic as a practical strategy for zero-shot learning and model adaptation. We make our code publicly available at https://github.com/DetectiveMB/Task-Arithmetic-for-ZS-IR.
中文摘要:本文提出的任务算术方法通过简单数学运算整合不同任务和领域的预训练模型权重,使大语言模型无需额外微调即可实现有效的零样本文档重排,在多项测试中显著提升了检索性能。
English Summary: This paper introduces Task Arithmetic, a technique that adapts Large Language Models for zero-shot document re-ranking by combining pre-trained weights from different tasks and domains through simple mathematical operations, achieving significant performance improvements without additional fine-tuning.

Authors:Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu
Title: Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Abstract:
Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution
中文摘要:本文针对大语言模型多智能体系统提出了自动化故障归因方法,通过Who&When数据集验证发现现有方法效果有限(最佳代理识别准确率53.5%,步骤定位仅14.2%),凸显该领域研究任重道远。
English Summary: This paper introduces automated failure attribution for LLM multi-agent systems, proposing the Who&When dataset and benchmarking methods that reveal significant challenges with current models achieving only 14.2-53.5% accuracy.

Authors:Zheng Zhang, Jinyi Li, Yihuai Lan, Xiang Wang, Hao Wang
Title: An Empirical Study on Prompt Compression for Large Language Models
Abstract:
Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression methods for LLMs, aiming to reduce prompt length while maintaining LLM response quality. In this paper, we present a comprehensive analysis covering aspects such as generation performance, model hallucinations, efficacy in multimodal tasks, word omission analysis, and more. We evaluate these methods across 13 datasets, including news, scientific articles, commonsense QA, math QA, long-context QA, and VQA datasets. Our experiments reveal that prompt compression has a greater impact on LLM performance in long contexts compared to short ones. In the Longbench evaluation, moderate compression even enhances LLM performance. Our code and data is available at https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression.
中文: 本研究评估了六种大型语言模型的提示压缩方法,发现它们能在保持响应质量的同时有效降低计算成本,且适度压缩在长上下文任务中甚至能提升性能。
English: This study evaluates six prompt compression methods for large language models, finding that they effectively reduce computational costs while maintaining response quality, with moderate compression even improving performance in long-context tasks.

Authors:Vinit K. Chavan
Title: Manifold-Constrained Sentence Embeddings via Triplet Loss: Projecting Semantics onto Spheres, Tori, and Möbius Strips
Abstract:
Recent advances in representation learning have emphasized the role of embedding geometry in capturing semantic structure. Traditional sentence embeddings typically reside in unconstrained Euclidean spaces, which may limit their ability to reflect complex relationships in language. In this work, we introduce a novel framework that constrains sentence embeddings to lie on continuous manifolds -- specifically the unit sphere, torus, and Möbius strip -- using triplet loss as the core training objective. By enforcing differential geometric constraints on the output space, our approach encourages the learning of embeddings that are both discriminative and topologically structured. We evaluate our method on benchmark datasets (AG News and MBTI) and compare it to classical baselines including TF-IDF, Word2Vec, and unconstrained Keras-derived embeddings. Our results demonstrate that manifold-constrained embeddings, particularly those projected onto spheres and Möbius strips, significantly outperform traditional approaches in both clustering quality (Silhouette Score) and classification performance (Accuracy). These findings highlight the value of embedding in manifold space -- where topological structure complements semantic separation -- offering a new and mathematically grounded direction for geometric representation learning in NLP.
中文: 本研究提出了一种流形约束的句子嵌入框架,将嵌入投影到球面和莫比乌斯带等几何表面上,在聚类和分类任务中均展现出优于传统欧几里得方法的性能。
English: This study introduces a manifold-constrained sentence embedding framework that projects embeddings onto geometric surfaces like spheres and Möbius strips, demonstrating superior performance over traditional Euclidean methods in both clustering and classification tasks.

Authors:Xiaoxi Li, Jiajie Jin, Guanting Dong, Hongjin Qian, Yutao Zhu, Yongkang Wu, Ji-Rong Wen, Zhicheng Dou
Title: WebThinker: Empowering Large Reasoning Models with Deep Research Capability
Abstract:
Large reasoning models (LRMs), such as OpenAI-o1 and DeepSeek-R1, demonstrate impressive long-horizon reasoning capabilities. However, their reliance on static internal knowledge limits their performance on complex, knowledge-intensive tasks and hinders their ability to produce comprehensive research reports requiring synthesis of diverse web information. To address this, we propose \textbf{WebThinker}, a deep research agent that empowers LRMs to autonomously search the web, navigate web pages, and draft research reports during the reasoning process. WebThinker integrates a \textbf{Deep Web Explorer} module, enabling LRMs to dynamically search, navigate, and extract information from the web when encountering knowledge gaps. It also employs an \textbf{Autonomous Think-Search-and-Draft strategy}, allowing the model to seamlessly interleave reasoning, information gathering, and report writing in real time. To further enhance research tool utilization, we introduce an \textbf{RL-based training strategy} via iterative online Direct Preference Optimization (DPO). Extensive experiments on complex reasoning benchmarks (GPQA, GAIA, WebWalkerQA, HLE) and scientific report generation tasks (Glaive) demonstrate that WebThinker significantly outperforms existing methods and strong proprietary systems. Our approach enhances LRM reliability and applicability in complex scenarios, paving the way for more capable and versatile deep research systems. The code is available at https://github.com/RUC-NLPIR/WebThinker.
中文: WebThinker是一种深度研究智能体,通过赋予大型推理模型自主网络搜索和实时报告撰写能力,在复杂推理基准测试中显著超越了现有方法。
English: WebThinker is a deep research agent that enhances large reasoning models by enabling autonomous web searching and real-time report drafting, significantly outperforming existing methods on complex reasoning benchmarks.

Authors:Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, Yi. R Fung
Title: MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Abstract:
The hallucination of non-existent facts by LLMs is an important problem given its widespread adoption across various applications. Previous research addresses this problem by analyzing the internal parameterized knowledge boundaries to estimate confidence. However, these studies focus on the single-problem setting and have not explored the more challenging multi-problem setting, which requires accurately answering multiple questions simultaneously. We introduce a novel method for the multi-problem setting, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25\% in average precision.
中文: 本文提出MAC-Tuning新方法,通过在指令微调中分离答案预测与置信度估计来解决多问题场景下的大语言模型幻觉问题,实验表明其平均精度比基线方法最高提升25%。
English: This paper introduces MAC-Tuning, a novel method that separates answer prediction and confidence estimation during fine-tuning to address LLM hallucination in multi-problem settings, achieving up to 25% higher average precision than baselines.

Authors:Marc Glocker, Peter Hönig, Matthias Hirschmanner, Markus Vincze
Title: LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics
Abstract:
We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.
中文摘要:本研究提出了一种基于大语言模型驱动代理编排架构的具身机器人系统,通过集成记忆增强任务规划和三个专业代理,无需显式模型训练即可实现自主家居物品管理,在多种家庭场景中展现出高任务规划精度和增强的记忆召回能力。
English Summary: This study introduces an embodied robotic system using an LLM-driven agent-orchestration architecture for autonomous household object management, integrating memory-augmented task planning and specialized agents to achieve high task accuracy and improved memory recall without explicit model training.

Authors:Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen
Title: Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization
Abstract:
Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method (Ada-R1) significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1
中文摘要:提出的Ada-R1框架通过混合模型集成和双层偏好训练自适应选择推理深度,在数学数据集上保持性能的同时将推理长度缩减超50%。
English Summary: The proposed Ada-R1 framework adaptively selects reasoning depth through hybrid model integration and bi-level training, cutting reasoning length by over 50% while maintaining performance across mathematical datasets.

Authors:Jiaming wang, Yunke Zhao, Peng Ding, Jun Kuang, Yibin Shen, Zhe Tang, Yilin Jin, ZongYu Wang, Xiaoyu Li, Xuezhi Cao, Xunliang Cai
Title: Meeseeks: A Feedback-Driven, Iterative Self-Correction Benchmark evaluating LLMs' Instruction Following Capability
Abstract:
The capability to precisely adhere to instructions is a cornerstone for Large Language Models (LLMs) to function as dependable agents in real-world scenarios. However, confronted with complex prompts, LLMs frequently encounter difficulties in fulfilling all specified requirements within a single response. Drawing inspiration from recent advancements in Chain-of-Thought (CoT) prompting and self-correction methodologies, we introduce Meeseeks (The name is inspired by Mr. Meeseeks from "Rick and Morty," a character renowned for efficiently accomplishing assigned tasks. See: https://en.wikipedia.org/wiki/Mr._Meeseeks), a fully automated iterative instruction-following benchmark equipped with an integrated feedback mechanism. Meeseeks identifies erroneous components in model responses and provides corresponding feedback accurately, thereby iteratively guiding the model toward self-correction. The dataset contains over 700 curated instances annotated by 32 distinct capability tags in Chinese and English. Extensive experimental results reveal that different state-of-the-art commercial and open-source LLMs exhibit vastly disparate performance, and even after 20 turns of iterative feedback-driven self-correction, nearly all models demonstrate suboptimal performance. We conducted comprehensive analysis from both macro and instance levels, uncovering numerous common issues prevalent in current state-of-the-art models, as well as several counterintuitive phenomena. We've open-sourced our work on https://github.com/ADoublLEN/Meeseeks.
中文:Meeseeks基准测试是一个自动化系统,能识别大语言模型响应中的错误并提供迭代反馈以引导自我修正,但即便经过多轮修正,大多数模型仍表现欠佳。
English: The Meeseeks benchmark is an automated system that identifies errors in LLM responses and provides iterative feedback to guide self-correction, yet most models still underperform even after multiple rounds.

Authors:Bing Wang, Ximing Li, Changchun Li, Bingrui Zhao, Bo Fu, Renchu Guan, Shengsheng Wang
Title: Robust Misinformation Detection by Visiting Potential Commonsense Conflict
Abstract:
The development of Internet technology has led to an increased prevalence of misinformation, causing severe negative effects across diverse domains. To mitigate this challenge, Misinformation Detection (MD), aiming to detect online misinformation automatically, emerges as a rapidly growing research topic in the community. In this paper, we propose a novel plug-and-play augmentation method for the MD task, namely Misinformation Detection with Potential Commonsense Conflict (MD-PCC). We take inspiration from the prior studies indicating that fake articles are more likely to involve commonsense conflict. Accordingly, we construct commonsense expressions for articles, serving to express potential commonsense conflicts inferred by the difference between extracted commonsense triplet and golden ones inferred by the well-established commonsense reasoning tool COMET. These expressions are then specified for each article as augmentation. Any specific MD methods can be then trained on those commonsense-augmented articles. Besides, we also collect a novel commonsense-oriented dataset named CoMis, whose all fake articles are caused by commonsense conflict. We integrate MD-PCC with various existing MD backbones and compare them across both 4 public benchmark datasets and CoMis. Empirical results demonstrate that MD-PCC can consistently outperform the existing MD baselines.
中文: 本文提出MD-PCC,一种用于虚假信息检测的即插即用增强方法,通过比较提取的常识三元组与推理得出的标准三元组来利用潜在常识冲突,在多个数据集上持续优于现有基线。
English: This paper introduces MD-PCC, a plug-and-play augmentation method for misinformation detection that leverages potential commonsense conflicts by comparing extracted and inferred commonsense triplets, consistently outperforming existing baselines across multiple datasets.

Authors:Haowen Hou, Zhiyi Huang, Kaifeng Tan, Rongchang Lu, Fei Richard Yu
Title: RWKV-X: A Linear Complexity Hybrid Language Model
Abstract:
In this paper, we introduce RWKV-X, a novel hybrid architecture that combines the efficiency of RWKV for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches that rely on full attention layers and retain quadratic complexity, RWKV-X achieves linear-time complexity in training and constant-time complexity in inference decoding. We demonstrate that RWKV-X, when continually pretrained on 64K-token sequences, achieves near-perfect accuracy on the 64K passkey retrieval benchmark. It consistently outperforms prior RWKV-7 models on long-context benchmarks, while maintaining strong performance on short-context tasks. These results highlight RWKV-X as a scalable and efficient backbone for general-purpose language modeling, capable of decoding sequences up to 1 million tokens with stable speed and memory usage. To facilitate further research and analysis, we have made the checkpoints and the associated code publicly accessible at: https://github.com/howard-hou/RWKV-X.
中文: RWKV-X是一种混合架构,将RWKV的短程建模效率与稀疏注意力机制相结合以捕捉长程上下文,在训练中实现线性时间复杂度和推理中恒定时间复杂度,在长上下文基准测试中超越先前模型,同时保持优异的短上下文性能。
English: RWKV-X is a hybrid architecture combining RWKV's efficiency for short-range modeling with sparse attention for long-range context, achieving linear-time training and constant-time inference while outperforming previous models on long-context benchmarks and maintaining strong short-context performance.

Authors:Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Title: SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding
Abstract:
With the rapid development of Multi-modal Large Language Models (MLLMs), an increasing number of benchmarks have been established to evaluate the video understanding capabilities of these models. However, these benchmarks focus on standalone videos and mainly assess "visual elements" like human actions and object states. In reality, contemporary videos often encompass complex and continuous narratives, typically presented as a series. To address this challenge, we propose SeriesBench, a benchmark consisting of 105 carefully curated narrative-driven series, covering 28 specialized tasks that require deep narrative understanding. Specifically, we first select a diverse set of drama series spanning various genres. Then, we introduce a novel long-span narrative annotation method, combined with a full-information transformation approach to convert manual annotations into diverse task formats. To further enhance model capacity for detailed analysis of plot structures and character relationships within series, we propose a novel narrative reasoning framework, PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still face significant challenges in understanding narrative-driven series, while PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our SeriesBench and PC-DCoT highlight the critical necessity of advancing model capabilities to understand narrative-driven series, guiding the future development of MLLMs. SeriesBench is publicly available at https://github.com/zackhxn/SeriesBench-CVPR2025.
中文: 本文提出了SeriesBench这一评估多模态大语言模型对叙事驱动视频系列理解能力的新基准,并开发了PC-DCoT推理框架,该框架能有效提升模型在分析复杂剧情结构和角色关系方面的性能表现。
English: This paper introduces SeriesBench, a novel benchmark for evaluating Multi-modal Large Language Models' understanding of narrative-driven video series, and proposes PC-DCoT, a reasoning framework that enhances models' performance in analyzing complex plot structures and character relationships.

Authors:Xuanzhao Dong, Wenhui Zhu, Hao Wang, Xiwen Chen, Peijie Qiu, Rui Yin, Yi Su, Yalin Wang
Title: Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA
Abstract:
Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: https://github.com/LLM-VLM-GSL/Discuss-RAG.
Chinese: Discuss-RAG通过引入基于智能体的协作推理机制来提升医学问答系统的检索相关性和答案准确性,在多个基准数据集上显著超越了现有方法的性能表现。
English: Discuss-RAG enhances medical QA systems by introducing collaborative agent-based reasoning to improve retrieval relevance and answer accuracy, achieving significant performance gains over existing methods.

Authors:Yu Zheng, Longyi Liu, Yuming Lin, Jie Feng, Guozhen Zhang, Depeng Jin, Yong Li
Title: UrbanPlanBench: A Comprehensive Urban Planning Benchmark for Evaluating Large Language Models
Abstract:
The advent of Large Language Models (LLMs) holds promise for revolutionizing various fields traditionally dominated by human expertise. Urban planning, a professional discipline that fundamentally shapes our daily surroundings, is one such field heavily relying on multifaceted domain knowledge and experience of human experts. The extent to which LLMs can assist human practitioners in urban planning remains largely unexplored. In this paper, we introduce a comprehensive benchmark, UrbanPlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, UrbanPlanText, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. By making our benchmark, dataset, and associated evaluation and fine-tuning toolsets publicly available at https://github.com/tsinghua-fib-lab/PlanBench, we aim to catalyze the integration of LLMs into practical urban planning, fostering a symbiotic collaboration between human expertise and machine intelligence.
中文: 本文提出UrbanPlanBench基准测试,揭示大语言模型在城市规划知识方面存在显著不足(尤其对法规理解薄弱),并发布UrbanPlanText微调数据集——虽能提升模型表现,但在专业术语与推理方面仍需大幅改进。
English: This paper introduces UrbanPlanBench, a benchmark that reveals large language models' significant limitations in urban planning knowledge, particularly in regulatory understanding, and presents UrbanPlanText—a fine-tuning dataset that improves model performance while highlighting ongoing challenges in domain-specific reasoning.

Authors:Tianqing Fang, Hongming Zhang, Zhisong Zhang, Kaixin Ma, Wenhao Yu, Haitao Mi, Dong Yu
Title: WebEvolver: Enhancing Web Agent Self-Improvement with Coevolving World Model
Abstract:
Agent self-improvement, where the backbone Large Language Model (LLM) of the agent are trained on trajectories sampled autonomously based on their own policies, has emerged as a promising approach for enhancing performance. Recent advancements, particularly in web environments, face a critical limitation: their performance will reach a stagnation point during autonomous learning cycles, hindering further improvement. We argue that this stems from limited exploration of the web environment and insufficient exploitation of pre-trained web knowledge in LLMs. To improve the performance of self-improvement, we propose a novel framework that introduces a co-evolving World Model LLM. This world model predicts the next observation based on the current observation and action within the web environment. Leveraging LLMs' pretrained knowledge of abundant web content, the World Model serves dual roles: (1) as a virtual web server generating self-instructed training data to continuously refine the agent's policy, and (2) as an imagination engine during inference, enabling look-ahead simulation to guide action selection for the agent LLM. Experiments in real-world web environments (Mind2Web-Live, WebVoyager, and GAIA-web) show a 10% performance gain over existing self-evolving agents, demonstrating the efficacy and generalizability of our approach, without using any distillation from more powerful close-sourced models. Our work establishes the necessity of integrating world models into autonomous agent frameworks to unlock sustained adaptability. Code is available at https://github.com/Tencent/SelfEvolvingAgent
中文: 该框架引入协同进化的世界模型大语言模型,既作为虚拟网络服务器生成自指导训练数据,又在推理时充当想象引擎,通过突破探索限制并利用预训练知识,在网络环境中实现了10%的性能提升。
English: The proposed framework introduces a co-evolving World Model LLM that acts as both a virtual web server for generating self-instructed training data and an imagination engine during inference, achieving a 10% performance gain in web environments by overcoming exploration limitations and leveraging pre-trained knowledge.

Authors:Yinghan Zhou, Juan Wen, Wanli Peng, Yiming Xue, Ziwei Zhang, Zhengxian Wu
Title: Kill two birds with one stone: generalized and robust AI-generated text detection via dynamic perturbations
Abstract:
The growing popularity of large language models has raised concerns regarding the potential to misuse AI-generated text (AIGT). It becomes increasingly critical to establish an excellent AIGT detection method with high generalization and robustness. However, existing methods either focus on model generalization or concentrate on robustness. The unified mechanism, to simultaneously address the challenges of generalization and robustness, is less explored. In this paper, we argue that robustness can be view as a specific form of domain shift, and empirically reveal an intrinsic mechanism for model generalization of AIGT detection task. Then, we proposed a novel AIGT detection method (DP-Net) via dynamic perturbations introduced by a reinforcement learning with elaborated reward and action. Experimentally, extensive results show that the proposed DP-Net significantly outperforms some state-of-the-art AIGT detection methods for generalization capacity in three cross-domain scenarios. Meanwhile, the DP-Net achieves best robustness under two text adversarial attacks. The code is publicly available at https://github.com/CAU-ISS-Lab/AIGT-Detection-Evade-Detection/tree/main/DP-Net.
Chinese: 本文提出了一种名为DP-Net的新型AI生成文本检测方法,通过强化学习引入动态扰动,在跨域场景中展现出卓越的泛化能力,并在对抗攻击下实现了最优的鲁棒性表现。
English: This paper introduces DP-Net, a novel AI-generated text detection method that employs dynamic perturbations via reinforcement learning, demonstrating superior generalization across domains and enhanced robustness against adversarial attacks compared to existing approaches.

Authors:Shangyu Li, Juyong Jiang, Tiancheng Zhao, Jiasi Shen
Title: OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Abstract:
We introduce OSVBench, a new benchmark for evaluating Large Language Models (LLMs) in generating complete specification code pertaining to operating system kernel verification tasks. The benchmark first defines the specification generation problem into a program synthesis problem within a confined scope of syntax and semantics by providing LLMs with the programming model. The LLMs are required to understand the provided verification assumption and the potential syntax and semantics space to search for, then generate the complete specification for the potentially buggy operating system code implementation under the guidance of the high-level functional description of the operating system. This benchmark is built upon a real-world operating system kernel, Hyperkernel, and consists of 245 complex specification generation tasks in total, each is a long context task of about 20k-30k tokens. Our comprehensive evaluation of 12 LLMs exhibits the limited performance of the current LLMs on the specification generation tasks for operating system verification. Significant disparities in their performance on the benchmark highlight differences in their ability to handle long-context code generation tasks. The evaluation toolkit and benchmark are available at https://github.com/lishangyu-hkust/OSVBench.
中文摘要:OSVBench是一个基于Hyperkernel操作系统内核的新基准,包含245个复杂任务,用于评估大语言模型在生成操作系统内核验证规范代码方面的能力,结果显示当前模型在处理长上下文代码生成任务上表现有限。
English Summary: OSVBench is a new benchmark for evaluating LLMs in generating complete specification code for operating system kernel verification, built upon the Hyperkernel with 245 complex tasks, revealing current models' limited performance in long-context code generation.

Authors:Haitao Wu, Zongbo Han, Joey Tianyi Zhou, Huaxi Huang, Changqing Zhang
Title: Computational Reasoning of Large Language Models
Abstract:
With the rapid development and widespread application of Large Language Models (LLMs), multidimensional evaluation has become increasingly critical. However, current evaluations are often domain-specific and overly complex, limiting their effectiveness as cross-domain proxies for core capabilities. To address these limitations and enable a unified and simple evaluation framework, an ideal proxy task should target a basic capability that generalizes across tasks and is independent of domain-specific knowledge. Turing machine provides a powerful theoretical lens by reducing complex processes to basic, domain-agnostic computational operations. This perspective offers a principled framework for evaluating basic computational abilities essential to a wide range of tasks. Motivated by this abstraction, we introduce \textbf{Turing Machine Bench}, a benchmark designed to assess the ability of LLMs to \textbf{strictly follow rules} and \textbf{accurately manage internal states} for multi-step, referred to as \textbf{computational reasoning}. TMBench incorporates four key features: self-contained and knowledge-agnostic reasoning, a minimalistic multi-step structure, controllable difficulty, and a solid theoretical foundation based on Turing machine. Empirical results demonstrate that TMBench serves as an effective proxy for evaluating computational reasoning on representative LLMs. It produces clear step-wise accuracy curves, revealing LLMs' ability to execute multi-step reasoning processes. By analyzing performance trends across TMBench and established reasoning benchmarks, we find strong correlations with real-world tasks, bridging real-task evaluation with basic ability assessment. These findings suggest that TMBench holds potential as a cross-domain dimension for evaluating reasoning in LLMs. Code and data are available at \href{https://github.com/HaitaoWuTJU/Turing-Machine-Bench}{Repo}.
中文: 该摘要介绍了基于图灵机原理的TMBench基准测试,通过评估大语言模型在多步骤过程中严格遵循规则和管理内部状态的能力来测试其计算推理水平,并显示出与实际任务的强相关性。
English: The abstract introduces TMBench, a benchmark based on Turing machine principles to evaluate LLMs' computational reasoning by testing their ability to strictly follow rules and manage internal states across multi-step processes, showing strong correlations with real-world tasks.

Authors:Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem
Title: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
Abstract:
Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.
中文: 本研究通过分析大语言模型的中间推理步骤,质疑最终答案的可靠性,并发现聚合分段子思维的答案能显著提高不同模型和数据集上的准确性。
English: This study questions the reliability of final answers from Large Language Models by analyzing intermediate reasoning steps and finds that aggregating answers from segmented subthoughts significantly improves accuracy across various models and datasets.

Authors:Rulin Shao, Rui Qiao, Varsha Kishore, Niklas Muennighoff, Xi Victoria Lin, Daniela Rus, Bryan Kian Hsiang Low, Sewon Min, Wen-tau Yih, Pang Wei Koh, Luke Zettlemoyer
Title: ReasonIR: Training Retrievers for Reasoning Tasks
Abstract:
We present ReasonIR-8B, the first retriever specifically trained for general reasoning tasks. Existing retrievers have shown limited gains on reasoning tasks, in part because existing training datasets focus on short factual queries tied to documents that straightforwardly answer them. We develop a synthetic data generation pipeline that, for each document, our pipeline creates a challenging and relevant query, along with a plausibly related but ultimately unhelpful hard negative. By training on a mixture of our synthetic data and existing public data, ReasonIR-8B achieves a new state-of-the-art of 29.9 nDCG@10 without reranker and 36.9 nDCG@10 with reranker on BRIGHT, a widely-used reasoning-intensive information retrieval (IR) benchmark. When applied to RAG tasks, ReasonIR-8B improves MMLU and GPQA performance by 6.4% and 22.6% respectively, relative to the closed-book baseline, outperforming other retrievers and search engines. In addition, ReasonIR-8B uses test-time compute more effectively: on BRIGHT, its performance consistently increases with longer and more information-rich rewritten queries; it continues to outperform other retrievers when combined with an LLM reranker. Our training recipe is general and can be easily extended to future LLMs; to this end, we open-source our code, data, and model.
中文:ReasonIR-8B是首个专为通用推理任务训练的检索器,通过创新的合成数据生成方法,在推理基准测试中取得最优性能,并显著提升了RAG任务的表现。
English: ReasonIR-8B is the first retriever specifically trained for general reasoning tasks, achieving state-of-the-art performance on reasoning benchmarks and significantly enhancing RAG task results through a novel synthetic data generation pipeline.

Authors:Yiping Wang, Qing Yang, Zhiyuan Zeng, Liliang Ren, Liyuan Liu, Baolin Peng, Hao Cheng, Xuehai He, Kuan Wang, Jianfeng Gao, Weizhu Chen, Shuohang Wang, Simon Shaolei Du, Yelong Shen
Title: Reinforcement Learning for Reasoning in Large Language Models with One Training Example
Abstract:
We show that reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the mathematical reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%, and improves the average performance across six common mathematical reasoning benchmarks from 17.6% to 35.7%. This result matches the performance obtained using the 1.2k DeepScaleR subset (MATH500: 73.6%, average: 35.9%), which includes the aforementioned example. Furthermore, RLVR with only two examples even slightly exceeds these results (MATH500: 74.8%, average: 36.6%). Similar substantial improvements are observed across various models (Qwen2.5-Math-7B, Llama3.2-3B-Instruct, DeepSeek-R1-Distill-Qwen-1.5B), RL algorithms (GRPO and PPO), and different math examples (when employed as a single training example). In addition, we identify some interesting phenomena during 1-shot RLVR, including cross-domain generalization, increased frequency of self-reflection, and sustained test performance improvement even after the training accuracy has saturated, a phenomenon we term post-saturation generalization. Moreover, we verify that the effectiveness of 1-shot RLVR primarily arises from the policy gradient loss, distinguishing it from the "grokking" phenomenon. We also show the critical role of promoting exploration (e.g., by incorporating entropy loss with an appropriate coefficient) in 1-shot RLVR training. We also further discuss related observations about format correction, label robustness and prompt modification. These findings can inspire future work on RLVR efficiency and encourage a re-examination of recent progress and the underlying mechanisms in RLVR. Our code, model, and data are open source at https://github.com/ypwang61/One-Shot-RLVR.
Chinese: 单样本可验证奖励强化学习(RLVR)显著提升了大语言模型的数学推理能力,将MATH500基准测试准确率从36.0%提升至73.6%,并展现出跨领域泛化能力和训练饱和后的持续性能提升。
English: One-shot reinforcement learning with verifiable reward (RLVR) significantly enhances large language models' mathematical reasoning, boosting performance on benchmarks like MATH500 from 36.0% to 73.6% and demonstrating cross-domain generalization and post-saturation improvement.

Authors:Nishant Subramani, Jason Eisner, Justin Svegliato, Benjamin Van Durme, Yu Su, Sam Thomson
Title: MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools
Abstract:
Tool-using agents that act in the world need to be both useful and safe. Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions, but prior work shows that many models are poorly calibrated. Inspired by interpretability literature exploring the internals of models, we propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools. MICE first decodes from each intermediate layer of the language model using logitLens and then computes similarity scores between each layer's generation and the final output. These features are fed into a learned probabilistic classifier to assess confidence in the decoded output. On the simulated trial and error (STE) tool-calling dataset using Llama3 models, we find that MICE beats or matches the baselines on smoothed expected calibration error. Using MICE confidences to determine whether to call a tool significantly improves over strong baselines on a new metric, expected tool-calling utility. Further experiments show that MICE is sample-efficient, can generalize zero-shot to unseen APIs, and results in higher tool-calling utility in scenarios with varying risk levels. Our code is open source, available at https://github.com/microsoft/mice_for_cats.
中文摘要:提出的MICE方法通过逐层相似性分析和概率分类计算内部置信度,显著提升了工具使用代理的安全性和实用性,在不同风险场景下均优于基线模型的校准精度和工具调用效能。
English Summary: The proposed MICE method enhances tool-using agents' safety and utility by computing internal confidence scores through layer-wise similarity analysis and probabilistic classification, outperforming baselines in calibration and tool-calling effectiveness across diverse scenarios.

Authors:Zae Myung Kim, Chanwoo Park, Vipul Raheja, Suin Kim, Dongyeop Kang
Title: Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
Abstract:
Reward-based alignment methods for large language models (LLMs) face two key limitations: vulnerability to reward hacking, where models exploit flaws in the reward signal; and reliance on brittle, labor-intensive prompt engineering when LLMs are used as reward models. We introduce Meta Policy Optimization (MPO), a framework that addresses these challenges by integrating a meta-reward model that dynamically refines the reward model's prompt throughout training. In MPO, the meta-reward model monitors the evolving training context and continuously adjusts the reward model's prompt to maintain high alignment, providing an adaptive reward signal that resists exploitation by the policy. This meta-learning approach promotes a more stable policy optimization, and greatly reduces the need for manual reward prompt design. It yields performance on par with or better than models guided by extensively hand-crafted reward prompts. Furthermore, we show that MPO maintains its effectiveness across diverse tasks, from essay writing to mathematical reasoning, without requiring specialized reward designs. Beyond standard RLAIF, MPO's meta-learning formulation is readily extensible to higher-level alignment frameworks. Overall, this method addresses theoretical and practical challenges in reward-based RL alignment for LLMs, paving the way for more robust and adaptable alignment strategies. The code and data can be accessed at: https://github.com/minnesotanlp/mpo
中文: 元策略优化(MPO)通过引入元奖励模型,在训练中动态优化奖励提示,有效应对奖励破解并减少对人工提示工程的依赖,同时在多样化任务中保持或超越手工设计提示的性能表现。
English: Meta Policy Optimization (MPO) introduces a meta-reward model that dynamically refines reward prompts during training, effectively combating reward hacking and reducing reliance on manual prompt engineering while maintaining or surpassing performance across diverse tasks.

Authors:Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
Title: RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Abstract:
Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on four stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and gradient stabilization. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.
中文摘要:本研究提出了用于多轮强化学习训练大语言模型智能体的StarPO框架和RAGEN系统,揭示了训练中的回声陷阱等挑战,强调需要采用稳定化训练方法和多样化奖励信号来促进智能体的有效推理能力。
English Summary: The study introduces StarPO and RAGEN frameworks for training LLM agents through multi-turn reinforcement learning, revealing challenges like Echo Trap and emphasizing the need for stabilized training methods and diverse reward signals to foster effective agent reasoning.

Authors:Kyo Gerrits, Ana Guerberof-Arenas
Title: To MT or not to MT: An eye-tracking study on the reception by Dutch readers of different translation and creativity levels
Abstract:
This article presents the results of a pilot study involving the reception of a fictional short story translated from English into Dutch under four conditions: machine translation (MT), post-editing (PE), human translation (HT) and original source text (ST). The aim is to understand how creativity and errors in different translation modalities affect readers, specifically regarding cognitive load. Eight participants filled in a questionnaire, read a story using an eye-tracker, and conducted a retrospective think-aloud (RTA) interview. The results show that units of creative potential (UCP) increase cognitive load and that this effect is highest for HT and lowest for MT; no effect of error was observed. Triangulating the data with RTAs leads us to hypothesize that the higher cognitive load in UCPs is linked to increases in reader enjoyment and immersion. The effect of translation creativity on cognitive load in different translation modalities at word-level is novel and opens up new avenues for further research. All the code and data are available at https://github.com/INCREC/Pilot_to_MT_or_not_to_MT
中文摘要:该试点研究表明,翻译中的创意元素会提高认知负荷,人工翻译中最为显著,机器翻译中最低,而错误无此影响,且认知负荷增加可能提升读者的阅读乐趣和沉浸感。
English summary: This pilot study reveals that creative elements in translations increase cognitive load most in human translations and least in machine translations, while errors show no effect, with higher cognitive load potentially enhancing reader enjoyment and immersion.

Authors:Mengxia Yu, Bang Nguyen, Olivia Zino, Meng Jiang
Title: Context Selection and Rewriting for Video-based Educational Question Generation
Abstract:
Educational question generation (EQG) is a crucial component of intelligent educational systems, significantly aiding self-assessment, active learning, and personalized education. While EQG systems have emerged, existing datasets typically rely on predefined, carefully edited texts, failing to represent real-world classroom content, including lecture speech with a set of complementary slides. To bridge this gap, we collect a dataset of educational questions based on lectures from real-world classrooms. On this realistic dataset, we find that current methods for EQG struggle with accurately generating questions from educational videos, particularly in aligning with specific timestamps and target answers. Common challenges include selecting informative contexts from extensive transcripts and ensuring generated questions meaningfully incorporate the target answer. To address the challenges, we introduce a novel framework utilizing large language models for dynamically selecting and rewriting contexts based on target timestamps and answers. First, our framework selects contexts from both lecture transcripts and video keyframes based on answer relevance and temporal proximity. Then, we integrate the contexts selected from both modalities and rewrite them into answer-containing knowledge statements, to enhance the logical connection between the contexts and the desired answer. This approach significantly improves the quality and relevance of the generated questions. Our dataset and code are released in https://github.com/mengxiayu/COSER.
中文摘要:本研究提出了一种利用大型语言模型动态筛选和重写教育视频上下文的新框架,通过增强与时间戳和目标答案的匹配度,解决了生成准确相关教育问题的挑战。
English Summary: This study introduces a novel framework using large language models to dynamically select and rewrite contexts from educational videos, addressing challenges in generating accurate and relevant questions by improving alignment with timestamps and target answers.

Authors:Peilin Zhou, Bruce Leon, Xiang Ying, Can Zhang, Yifan Shao, Qichen Ye, Dading Chong, Zhiling Jin, Chenxuan Xie, Meng Cao, Yuxin Gu, Sixin Hong, Jing Ren, Jian Chen, Chao Liu, Yining Hua
Title: BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese
Abstract:
As large language models (LLMs) evolve into tool-using agents, the ability to browse the web in real-time has become a critical yardstick for measuring their reasoning and retrieval competence. Existing benchmarks such as BrowseComp concentrate on English and overlook the linguistic, infrastructural, and censorship-related complexities of other major information ecosystems -- most notably Chinese. To address this gap, we introduce BrowseComp-ZH, a high-difficulty benchmark purpose-built to comprehensively evaluate LLM agents on the Chinese web. BrowseComp-ZH consists of 289 multi-hop questions spanning 11 diverse domains. Each question is reverse-engineered from a short, objective, and easily verifiable answer (e.g., a date, number, or proper noun). A two-stage quality control protocol is applied to strive for high question difficulty and answer uniqueness. We benchmark over 20 state-of-the-art language models and agentic search systems on our proposed BrowseComp-ZH. Despite their strong conversational and retrieval capabilities, most models struggle severely: a large number achieve accuracy rates below 10%, and only a handful exceed 20%. Even the best-performing system, OpenAI's DeepResearch, reaches just 42.9%. These results demonstrate the considerable difficulty of BrowseComp-ZH, where success demands not only effective retrieval strategies, but also sophisticated reasoning and information reconciliation -- capabilities that current models still struggle to master. Our dataset, construction guidelines, and benchmark results have been publicly released at https://github.com/PALIN2018/BrowseComp-ZH.
中文: 针对现有基准忽略中文网络复杂性的问题,BrowseComp-ZH作为高难度中文网页评估基准被提出,大多数模型在其测试中表现不佳,凸显了当前模型在检索与推理能力上的不足。
English: To address the lack of benchmarks for evaluating LLM agents on the Chinese web, BrowseComp-ZH is introduced as a high-difficulty, multi-domain dataset where most models perform poorly, highlighting the challenges in retrieval and reasoning.

Authors:Hanyu Lai, Junjie Gao, Xiao Liu, Yifan Xu, Shudan Zhang, Yuxiao Dong, Jie Tang
Title: AndroidGen: Building an Android Language Agent under Data Scarcity
Abstract:
Large language models have opened up a world of possibilities for various NLP tasks, sparking optimism for the future. Despite their potential, LLMs have yet to be widely used as agents on real mobile devices. The main challenge is the need for high-quality data sources. Time constraints and labor intensity often hinder human annotation. On the other hand, existing LLMs exhibit inadequate completion rates and need a robust data filtration strategy. Given these challenges, we develop a framework called AndroidGen to enhance the capabilities of LLM-based agents under data scarcity. In addition, we leverage AndroidGen to collect trajectories given human tasks and train open-source LLMs on these trajectories to develop an open-source mobile agent without manually labeled trajectories. We extensively evaluate AndroidGen with AndroidWorld, AitW, and various popular applications, demonstrating its improvements and revealing potential areas for future improvement. Code, model, and data are available at https://github.com/THUDM/AndroidGen.
中文: AndroidGen框架通过生成高质量数据轨迹来训练大型语言模型,解决了其在移动设备代理应用中数据稀缺和性能不足的问题,无需人工标注,并在多个基准测试中验证了其有效性。
English: The AndroidGen framework addresses the limitations of large language models in mobile agent applications by generating high-quality data trajectories for training, thereby improving performance without manual annotation, as validated across multiple benchmarks.

Authors:Dylan Bouchard, Mohit Singh Chauhan
Title: Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers
Abstract:
Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we outline a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we propose a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper's companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.
中文: 本文提出了一种无需外部资源的通用框架,通过将多种不确定性量化技术转化为标准化置信度分数,并采用可调整的集成方法,有效检测大语言模型的幻觉问题,其性能优于现有方法。
English: This paper introduces a versatile zero-resource framework for detecting hallucinations in Large Language Models by adapting uncertainty quantification techniques into standardized confidence scores and proposing a tunable ensemble approach that outperforms existing methods.

Authors:Jianlong Chen, Chao Li, Yang Yuan, Andrew C Yao
Title: Hierarchical Attention Generates Better Proofs
Abstract:
Large language models (LLMs) have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbf{Hierarchical Attention}, a regularization method that aligns LLMs' attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05\% on miniF2F and 1.69\% on ProofNet while reducing proof complexity by 23.81\% and 16.50\% respectively. The code is available at https://github.com/Car-pe/HAGBP.
中文: 分层注意力是一种正则化方法,通过将大语言模型的注意力机制与数学推理结构对齐,提高了定理证明的成功率并降低了证明复杂度。
English: Hierarchical Attention is a regularization method that aligns LLMs' attention with mathematical reasoning structures, improving proof success rates and reducing complexity in theorem proving.

Authors:Jiaqi Chen, Bang Zhang, Ruotian Ma, Peisong Wang, Xiaodan Liang, Zhaopeng Tu, Xiaolong Li, Kwan-Yee K. Wong
Title: SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning
Abstract:
Evaluating the step-by-step reliability of large language model (LLM) reasoning, such as Chain-of-Thought, remains challenging due to the difficulty and cost of obtaining high-quality step-level supervision. In this paper, we introduce Self-Play Critic (SPC), a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games, eliminating the need for manual step-level annotation. SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" that deliberately produces erroneous steps designed to be difficult to detect, and a "critic" that analyzes the correctness of reasoning steps. These two models engage in an adversarial game in which the generator aims to fool the critic, while the critic model seeks to identify the generator's errors. Using reinforcement learning based on the game outcomes, the models iteratively improve; the winner of each confrontation receives a positive reward and the loser receives a negative reward, driving continuous self-evolution. Experiments on three reasoning process benchmarks (ProcessBench, PRM800K, DeltaBench) demonstrate that our SPC progressively enhances its error detection capabilities (e.g., accuracy increases from 70.8% to 77.7% on ProcessBench) and surpasses strong baselines, including distilled R1 model. Furthermore, SPC can guide the test-time search of diverse LLMs and significantly improve their mathematical reasoning performance on MATH500 and AIME2024, surpassing those guided by state-of-the-art process reward models.

Authors:Jikai Wang, Juntao Li, Jianye Hou, Bowen Yan, Lijun Wu, Min Zhang
Title: Efficient Reasoning for LLMs through Speculative Chain-of-Thought
Abstract:
Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy of the target model for complex tasks. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48\%$\sim$66\% and 21\%$\sim$49\% for Deepseek-R1-Distill-Qwen-32B and Deepseek-R1-Distill-Llama-70B while achieving near-target-model-level performance. Our code is available at https://github.com/Jikai0Wang/Speculative_CoT.
中文: 本文提出的推测性思维链(SCoT)方法通过大小模型协作加速平均推理速度,在保持接近目标模型性能的同时,将推理延迟降低了21%至66%。
English: The paper introduces Speculative Chain-of-Thought (SCoT), a method that reduces reasoning latency by accelerating average reasoning speed through collaboration between large and small models, achieving near-target-model performance while cutting latency by 21% to 66% across various benchmarks.

Authors:Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao
Title: Versatile Framework for Song Generation with Prompt-based Control
Abstract:
Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at https://aaronz345.github.io/VersBandDemo and https://github.com/AaronZ345/VersBand.
Chinese: VersBand 是一个多任务歌曲生成框架,通过专门模型实现基于提示的高质量对齐歌曲合成,在人声、伴奏、歌词和旋律方面均优于基线模型。
English: VersBand is a multi-task song generation framework that synthesizes high-quality, aligned songs with prompt-based control, outperforming baselines across various tasks through its specialized models for vocals, accompaniments, lyrics, and melodies.

Authors:Mohammad Mahdi Abootorabi, Omid Ghahroodi, Pardis Sadat Zahraei, Hossein Behzadasl, Alireza Mirrokni, Mobina Salimipanah, Arash Rasouli, Bahar Behzadipour, Sara Azarnoush, Benyamin Maleki, Erfan Sadraiye, Kiarash Kiani Feriz, Mahdi Teymouri Nahad, Ali Moghadasi, Abolfazl Eshagh Abianeh, Nizi Nazar, Hamid R. Rabiee, Mahdieh Soleymani Baghshah, Meisam Ahmadi, Ehsaneddin Asgari
Title: Generative AI for Character Animation: A Comprehensive Survey of Techniques, Applications, and Future Directions
Abstract:
Generative AI is reshaping art, gaming, and most notably animation. Recent breakthroughs in foundation and diffusion models have reduced the time and cost of producing animated content. Characters are central animation components, involving motion, emotions, gestures, and facial expressions. The pace and breadth of advances in recent months make it difficult to maintain a coherent view of the field, motivating the need for an integrative review. Unlike earlier overviews that treat avatars, gestures, or facial animation in isolation, this survey offers a single, comprehensive perspective on all the main generative AI applications for character animation. We begin by examining the state-of-the-art in facial animation, expression rendering, image synthesis, avatar creation, gesture modeling, motion synthesis, object generation, and texture synthesis. We highlight leading research, practical deployments, commonly used datasets, and emerging trends for each area. To support newcomers, we also provide a comprehensive background section that introduces foundational models and evaluation metrics, equipping readers with the knowledge needed to enter the field. We discuss open challenges and map future research directions, providing a roadmap to advance AI-driven character-animation technologies. This survey is intended as a resource for researchers and developers entering the field of generative AI animation or adjacent fields. Resources are available at: https://github.com/llm-lab-org/Generative-AI-for-Character-Animation-Survey.
Chinese: 生成式AI通过整合面部、手势和动作合成等领域的进展,正在革新角色动画技术,本综述为研究人员提供了全面指导,涵盖当前技术与未来发展方向。
English: Generative AI is revolutionizing character animation by integrating advancements in facial, gesture, and motion synthesis, offering a comprehensive review to guide researchers through current technologies and future directions.

Authors:Di Wu, Yibin Lei, Christof Monz
Title: Calibrating Translation Decoding with Quality Estimation on LLMs
Abstract:
Neural machine translation (NMT) systems typically employ maximum a posteriori (MAP) decoding to select the highest-scoring translation from the distribution mass. However, recent evidence highlights the inadequacy of MAP decoding, often resulting in low-quality or even pathological hypotheses -- the decoding objective is not aligned with real-world translation quality. This paper proposes calibrating hypothesis likelihoods with translation quality from a distribution view by directly optimizing their Pearson correlation -- thereby enhancing the effectiveness of translation decoding. With our method, translation on large language models (LLMs) improves substantially after limited training (2K instances per direction). This improvement is orthogonal to those achieved through supervised fine-tuning, leading to substantial gains across a broad range of metrics and human evaluations -- even when applied to top-performing translation-specialized LLMs fine-tuned on high-quality translation data, such as Tower, or when compared to recent preference optimization methods, like CPO. Moreover, the calibrated translation likelihood can directly serve as a strong proxy for translation quality, closely approximating or even surpassing some state-of-the-art translation quality estimation models, like CometKiwi. Lastly, our in-depth analysis demonstrates that calibration enhances the effectiveness of MAP decoding, thereby enabling greater efficiency in real-world deployment. The resulting state-of-the-art translation model, which covers 10 languages, along with the accompanying code and human evaluation data, has been released to the community: https://github.com/moore3930/calibrating-llm-mt.
中文: 本文提出一种通过优化假设似然与翻译质量相关性进行校准的方法,仅需少量训练即可大幅提升大语言模型的翻译性能,同时提高解码效率并可直接作为翻译质量评估指标。
English: This paper introduces a calibration method that optimizes the correlation between hypothesis likelihoods and translation quality, significantly improving neural machine translation performance in large language models with minimal training and enhancing decoding efficiency.

Authors:Mohammad Akbar-Tajari, Mohammad Taher Pilehvar, Mohammad Mahmoody
Title: Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs
Abstract:
The challenge of ensuring Large Language Models (LLMs) align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks.
中文: GoAT是一种基于图推理框架的黑盒方法,能高效生成可读性强且有效的对抗性提示来测试大语言模型的对齐性,相比现有最优攻击方法,它以更少查询次数实现了显著更高的越狱成功率。
English: GoAT is a black-box method that uses a graph-based reasoning framework to efficiently generate effective, human-readable adversarial prompts for testing LLM alignment, achieving significantly higher jailbreak success rates with fewer queries than current state-of-the-art attacks.

Authors:Debarati Das, Khanh Chi Le, Ritik Sachin Parkar, Karin De Langis, Brendan Madson, Chad M. Berryman, Robin M. Willis, Daniel H. Moses, Brett McDonnell, Daniel Schwarcz, Dongyeop Kang
Title: LawFlow: Collecting and Simulating Lawyers' Thought Processes on Business Formation Case Studies
Abstract:
Legal practitioners, particularly those early in their careers, face complex, high-stakes tasks that require adaptive, context-sensitive reasoning. While AI holds promise in supporting legal work, current datasets and models are narrowly focused on isolated subtasks and fail to capture the end-to-end decision-making required in real-world practice. To address this gap, we introduce LawFlow, a dataset of complete end-to-end legal workflows collected from trained law students, grounded in real-world business entity formation scenarios. Unlike prior datasets focused on input-output pairs or linear chains of thought, LawFlow captures dynamic, modular, and iterative reasoning processes that reflect the ambiguity, revision, and client-adaptive strategies of legal practice. Using LawFlow, we compare human and LLM-generated workflows, revealing systematic differences in structure, reasoning flexibility, and plan execution. Human workflows tend to be modular and adaptive, while LLM workflows are more sequential, exhaustive, and less sensitive to downstream implications. Our findings also suggest that legal professionals prefer AI to carry out supportive roles, such as brainstorming, identifying blind spots, and surfacing alternatives, rather than executing complex workflows end-to-end. Our results highlight both the current limitations of LLMs in supporting complex legal workflows and opportunities for developing more collaborative, reasoning-aware legal AI systems. All data and code are available on our project page (https://minnesotanlp.github.io/LawFlow-website/).

Authors:Hayley Ross, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara
Title: When2Call: When (not) to Call Tools
Abstract:
Leveraging external tools is a key feature for modern Language Models (LMs) to expand their capabilities and integrate them into existing systems. However, existing benchmarks primarily focus on the accuracy of tool calling -- whether the correct tool is called with the correct parameters -- and less on evaluating when LMs should (not) call tools. We develop a new benchmark, When2Call, which evaluates tool-calling decision-making: when to generate a tool call, when to ask follow-up questions and when to admit the question can't be answered with the tools provided. We find that state-of-the-art tool-calling LMs show significant room for improvement on When2Call, indicating the importance of this benchmark. We also develop a training set for When2Call and leverage the multiple-choice nature of the benchmark to develop a preference optimization training regime, which shows considerably more improvement than traditional fine-tuning. We release the benchmark and training data as well as evaluation scripts at https://github.com/NVIDIA/When2Call.
Chinese: When2Call基准测试评估语言模型在何时使用工具方面的决策能力,揭示了当前模型的显著不足,并提出了优于传统微调的训练方法。
English: The When2Call benchmark evaluates language models' decision-making on when to use tools, revealing significant gaps in current models and introducing a training method that outperforms traditional fine-tuning.

Authors:Jong Inn Park, Maanas Taneja, Qianwen Wang, Dongyeop Kang
Title: Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation
Abstract:
Generating engaging, accurate short-form videos from scientific papers is challenging due to content complexity and the gap between expert authors and readers. Existing end-to-end methods often suffer from factual inaccuracies and visual artifacts, limiting their utility for scientific dissemination. To address these issues, we propose SciTalk, a novel multi-LLM agentic framework, grounding videos in various sources, such as text, figures, visual styles, and avatars. Inspired by content creators' workflows, SciTalk uses specialized agents for content summarization, visual scene planning, and text and layout editing, and incorporates an iterative feedback mechanism where video agents simulate user roles to give feedback on generated videos from previous iterations and refine generation prompts. Experimental evaluations show that SciTalk outperforms simple prompting methods in generating scientifically accurate and engaging content over the refined loop of video generation. Although preliminary results are still not yet matching human creators' quality, our framework provides valuable insights into the challenges and benefits of feedback-driven video generation. Our code, data, and generated videos will be publicly available.

Authors:Jianyou Wang, Weili Cao, Kaicheng Wang, Xiaoyue Wang, Ashish Dalvi, Gino Prasad, Qishan Liang, Hsuan-lin Her, Ming Wang, Qin Yang, Gene W. Yeo, David E. Neal, Maxim Khan, Christopher D. Rosin, Ramamohan Paturi, Leon Bergen
Title: EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers
Abstract:
We study the task of automatically finding evidence relevant to hypotheses in biomedical papers. Finding relevant evidence is an important step when researchers investigate scientific hypotheses. We introduce EvidenceBench to measure models performance on this task, which is created by a novel pipeline that consists of hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence, completely guided by and faithfully following existing human experts judgment. We demonstrate the pipeline's validity and accuracy with multiple sets of human-expert annotations. We evaluated a diverse set of language models and retrieval systems on the benchmark and found that model performances still fall significantly short of the expert level on this task. To show the scalability of our proposed pipeline, we create a larger EvidenceBench-100k with 107,461 fully annotated papers with hypotheses to facilitate model training and development. Both datasets are available at https://github.com/EvidenceBench/EvidenceBench
中文: 本研究提出了EvidenceBench基准,通过基于专家判断生成假设并标注论文的流程,评估模型在识别生物医学假设相关证据方面的表现,发现现有模型性能仍远低于专家水平。
English: This research introduces EvidenceBench, a benchmark for evaluating how well models identify evidence relevant to biomedical hypotheses, created through a pipeline that generates hypotheses and annotates papers based on expert judgments, and finds current models still lag significantly behind human expert performance.

Authors:KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y. Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yuefeng Wu, Yuxin Wu, Dongchao Yang, Hao Yang, Ying Yang, Zhilin Yang, Aoxiong Yin, Ruibin Yuan, Yutong Zhang, Zaida Zhou
Title: Kimi-Audio Technical Report
Abstract:
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
Chinese: Kimi-Audio是一款开源的音频基础模型,凭借创新的架构和海量数据训练,在音频理解、生成与对话方面表现卓越,并在多项基准测试中达到领先水平。
English: Kimi-Audio is an open-source audio foundation model that excels in understanding, generating, and conversing with audio, achieving state-of-the-art performance across various benchmarks through innovative architecture and extensive data training.

Authors:Lei Shen, Xiaoyu Shen
Title: Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant
Abstract:
In recent years, multi-agent frameworks powered by large language models (LLMs) have advanced rapidly. Despite this progress, there is still a notable absence of benchmark datasets specifically tailored to evaluate their performance. To bridge this gap, we introduce Auto-SLURP, a benchmark dataset aimed at evaluating LLM-based multi-agent frameworks in the context of intelligent personal assistants. Auto-SLURP extends the original SLURP dataset -- initially developed for natural language understanding tasks -- by relabeling the data and integrating simulated servers and external services. This enhancement enables a comprehensive end-to-end evaluation pipeline, covering language understanding, task execution, and response generation. Our experiments demonstrate that Auto-SLURP presents a significant challenge for current state-of-the-art frameworks, highlighting that truly reliable and intelligent multi-agent personal assistants remain a work in progress. The dataset and related code are available at https://github.com/lorashen/Auto-SLURP/.
Chinese: Auto-SLURP是一个通过重新标注SLURP数据集并集成模拟服务来评估基于大语言模型的多智能体个人助手的基准数据集,实验表明现有先进框架仍难以实现真正可靠的智能表现。
English: Auto-SLURP is a new benchmark dataset designed to evaluate LLM-based multi-agent frameworks for intelligent personal assistants by extending the SLURP dataset with relabeled data and simulated services, revealing current systems' limitations in achieving reliable performance.

Authors:Ritesh Goru, Shanay Mehta, Prateek Jain
Title: One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning
Abstract:
Fine-tuning Large Language Models (LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from $O\bigl(N^{3}\bigl)$ to $O\bigl(N^{2}\bigl)$ and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online (https://github.com/devrev/One-Pass-to-Reason).
Chinese: 该方法通过复制响应令牌和自定义注意力掩码,实现了多轮对话的单次处理,在保持精度的同时将时间复杂度从O(N³)降低至O(N²),且损失与N次处理方法完全相同。
English: The proposed method duplicates response tokens with a custom attention mask to enable single-pass processing of multi-turn conversations, achieving identical losses to the N-pass approach while reducing time complexity from O(N³) to O(N²) and preserving accuracy.

Authors:Jingjin Wang
Title: PropRAG: Guiding Retrieval with Beam Search over Proposition Paths
Abstract:
Retrieval Augmented Generation (RAG) has become the standard non-parametric approach for equipping Large Language Models (LLMs) with up-to-date knowledge and mitigating catastrophic forgetting common in continual learning. However, standard RAG, relying on independent passage retrieval, fails to capture the interconnected nature of human memory crucial for complex reasoning (associativity) and contextual understanding (sense-making). While structured RAG methods like HippoRAG utilize knowledge graphs (KGs) built from triples, the inherent context loss limits fidelity. We introduce PropRAG, a framework leveraging contextually rich propositions and a novel beam search algorithm over proposition paths to explicitly discover multi-step reasoning chains. Crucially, PropRAG's online retrieval process operates entirely without invoking generative LLMs, relying instead on efficient graph traversal and pre-computed embeddings. This avoids online LLM inference costs and potential inconsistencies during evidence gathering. LLMs are used effectively offline for high-quality proposition extraction and post-retrieval for answer generation. PropRAG achieves state-of-the-art zero-shot Recall@5 results on PopQA (55.3%), 2Wiki (93.7%), HotpotQA (97.0%), and MuSiQue (77.3%), alongside top F1 scores (e.g., 52.4% on MuSiQue). By improving evidence retrieval through richer representation and explicit, LLM-free online path finding, PropRAG advances non-parametric continual learning.
Chinese: PropRAG 提出了一种利用情境化命题和基于命题路径的束搜索框架,无需在线调用大语言模型即可实现显式多步推理,通过更丰富的表征改进证据检索,在多个基准测试中取得了最先进的性能。
English: PropRAG introduces a framework using contextual propositions and beam search over proposition paths to enable explicit multi-step reasoning without online LLM inference, achieving state-of-the-art results on multiple benchmarks by enhancing evidence retrieval through richer representations.

Authors:Jianyu Liu, Hangyu Guo, Ranjie Duan, Xingyuan Bu, Yancheng He, Shilong Li, Hui Huang, Jiaheng Liu, Yucheng Wang, Chenchen Jing, Xingwei Qu, Xiao Zhang, Yingshui Tan, Yanan Wu, Jihao Gu, Yangguang Li, Jianke Zhu
Title: DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
Abstract:
Multimodal Large Language Models (MLLMs) pose unique safety challenges due to their integration of visual and textual data, thereby introducing new dimensions of potential attacks and complex risk combinations. In this paper, we begin with a detailed analysis aimed at disentangling risks through step-by-step reasoning within multimodal inputs. We find that systematic multimodal risk disentanglement substantially enhances the risk awareness of MLLMs. Via leveraging the strong discriminative abilities of multimodal risk disentanglement, we further introduce \textbf{DREAM} (\textit{\textbf{D}isentangling \textbf{R}isks to \textbf{E}nhance Safety \textbf{A}lignment in \textbf{M}LLMs}), a novel approach that enhances safety alignment in MLLMs through supervised fine-tuning and iterative Reinforcement Learning from AI Feedback (RLAIF). Experimental results show that DREAM significantly boosts safety during both inference and training phases without compromising performance on normal tasks (namely oversafety), achieving a 16.17\% improvement in the SIUO safe\&effective score compared to GPT-4V. The data and code are available at https://github.com/Kizna1ver/DREAM.
中文: DREAM方法通过监督微调和强化学习系统性地解构多模态大语言模型中的风险,在保持正常任务性能的同时将安全有效性评分比GPT-4V提升了16.17%。
English: The DREAM method enhances safety in Multimodal Large Language Models by systematically disentangling risks through supervised fine-tuning and reinforcement learning, achieving a 16.17% improvement in safety scores over GPT-4V without compromising performance.

Authors:Yiwei Zha
Title: SMARTFinRAG: Interactive Modularized Financial RAG Benchmark
Abstract:
Financial sectors are rapidly adopting language model technologies, yet evaluating specialized RAG systems in this domain remains challenging. This paper introduces SMARTFinRAG, addressing three critical gaps in financial RAG assessment: (1) a fully modular architecture where components can be dynamically interchanged during runtime; (2) a document-centric evaluation paradigm generating domain-specific QA pairs from newly ingested financial documents; and (3) an intuitive interface bridging research-implementation divides. Our evaluation quantifies both retrieval efficacy and response quality, revealing significant performance variations across configurations. The platform's open-source architecture supports transparent, reproducible research while addressing practical deployment challenges faced by financial institutions implementing RAG systems.
中文摘要:本文介绍了SMARTFinRAG平台,通过动态组件交换、以文档为中心的问答生成和直观界面解决金融RAG评估的关键缺口,同时量化性能差异并支持透明化研究。
English Summary: This paper introduces SMARTFinRAG, a modular platform addressing key gaps in financial RAG evaluation through dynamic component interchange, document-centric QA generation, and an intuitive interface, while quantifying performance variations and supporting transparent research.

Authors:Haokai Zhang, Shengtao Zhang, Zijian Cai, Heng Wang, Ruixuan Zhu, Zinan Zeng, Minnan Luo
Title: Unveiling the Hidden: Movie Genre and User Bias in Spoiler Detection
Abstract:
Spoilers in movie reviews are important on platforms like IMDb and Rotten Tomatoes, offering benefits and drawbacks. They can guide some viewers' choices but also affect those who prefer no plot details in advance, making effective spoiler detection essential. Existing spoiler detection methods mainly analyze review text, often overlooking the impact of movie genres and user bias, limiting their effectiveness. To address this, we analyze movie review data, finding genre-specific variations in spoiler rates and identifying that certain users are more likely to post spoilers. Based on these findings, we introduce a new spoiler detection framework called GUSD (The code is available at https://github.com/AI-explorer-123/GUSD) (Genre-aware and User-specific Spoiler Detection), which incorporates genre-specific data and user behavior bias. User bias is calculated through dynamic graph modeling of review history. Additionally, the R2GFormer module combines RetGAT (Retentive Graph Attention Network) for graph information and GenreFormer for genre-specific aggregation. The GMoE (Genre-Aware Mixture of Experts) model further assigns reviews to specialized experts based on genre. Extensive testing on benchmark datasets shows that GUSD achieves state-of-the-art results. This approach advances spoiler detection by addressing genre and user-specific patterns, enhancing user experience on movie review platforms.
中文摘要:GUSD框架通过整合电影类型特征和用户行为偏差,利用动态图建模和类型感知模块,显著提升了影评中剧透检测的效果,实现了最先进的性能。
English Summary: The GUSD framework improves spoiler detection in movie reviews by incorporating genre-specific patterns and user behavior biases, achieving state-of-the-art results through dynamic graph modeling and specialized genre-aware modules.

Authors:Jihyun Lee, Yejin Jeon, Seungyeon Seo, Gary Geunbae Lee
Title: PicPersona-TOD : A Dataset for Personalizing Utterance Style in Task-Oriented Dialogue with Image Persona
Abstract:
Task-Oriented Dialogue (TOD) systems are designed to fulfill user requests through natural language interactions, yet existing systems often produce generic, monotonic responses that lack individuality and fail to adapt to users' personal attributes. To address this, we introduce PicPersona-TOD, a novel dataset that incorporates user images as part of the persona, enabling personalized responses tailored to user-specific factors such as age or emotional context. This is facilitated by first impressions, dialogue policy-guided prompting, and the use of external knowledge to reduce hallucinations. Human evaluations confirm that our dataset enhances user experience, with personalized responses contributing to a more engaging interaction. Additionally, we introduce a new NLG model, Pictor, which not only personalizes responses, but also demonstrates robust performance across unseen domains https://github.com/JihyunLee1/PicPersona.
中文: PicPersona-TOD是一种创新数据集,通过整合用户图像实现个性化对话回复,借助定制化互动提升用户参与度并减少通用性回答。
English: PicPersona-TOD is a novel dataset that integrates user images to enable personalized dialogue responses, improving user engagement through tailored interactions and reducing generic outputs.

Authors:Yongxuan Wu, Runyu Chen, Peiyu Liu, Hongjin Qian
Title: LiveLongBench: Tackling Long-Context Understanding for Spoken Texts from Live Streams
Abstract:
Long-context understanding poses significant challenges in natural language processing, particularly for real-world dialogues characterized by speech-based elements, high redundancy, and uneven information density. Although large language models (LLMs) achieve impressive results on existing benchmarks, these datasets fail to reflect the complexities of such texts, limiting their applicability to practical scenarios. To bridge this gap, we construct the first spoken long-text dataset, derived from live streams, designed to reflect the redundancy-rich and conversational nature of real-world scenarios. We construct tasks in three categories: retrieval-dependent, reasoning-dependent, and hybrid. We then evaluate both popular LLMs and specialized methods to assess their ability to understand long-contexts in these tasks. Our results show that current methods exhibit strong task-specific preferences and perform poorly on highly redundant inputs, with no single method consistently outperforming others. We propose a new baseline that better handles redundancy in spoken text and achieves strong performance across tasks. Our findings highlight key limitations of current methods and suggest future directions for improving long-context understanding. Finally, our benchmark fills a gap in evaluating long-context spoken language understanding and provides a practical foundation for developing real-world e-commerce systems. The code and benchmark are available at https://github.com/Yarayx/livelongbench.
中文: 本研究构建了首个基于直播的口语长文本数据集,以弥补现有基准在反映真实对话冗余性和复杂性方面的不足,发现当前方法在处理高冗余输入时表现不佳,并提出一种新基线在各项任务中均取得强劲性能。
English: This study introduces the first spoken long-text dataset from live streams to address the limitations of current benchmarks in capturing the redundancy and conversational complexity of real-world dialogues, revealing that existing methods struggle with highly redundant inputs and proposing a new baseline that improves performance across tasks.

Authors:Xiuying Chen, Tairan Wang, Juexiao Zhou, Zirui Song, Xin Gao, Xiangliang Zhang
Title: Evaluating and Mitigating Bias in AI-Based Medical Text Generation
Abstract:
Artificial intelligence (AI) systems, particularly those based on deep learning models, have increasingly achieved expert-level performance in medical applications. However, there is growing concern that such AI systems may reflect and amplify human bias, and reduce the quality of their performance in historically under-served populations. The fairness issue has attracted considerable research interest in the medical imaging classification field, yet it remains understudied in the text generation domain. In this study, we investigate the fairness problem in text generation within the medical field and observe significant performance discrepancies across different races, sexes, and age groups, including intersectional groups, various model scales, and different evaluation metrics. To mitigate this fairness issue, we propose an algorithm that selectively optimizes those underperformed groups to reduce bias. The selection rules take into account not only word-level accuracy but also the pathology accuracy to the target reference, while ensuring that the entire process remains fully differentiable for effective model training. Our evaluations across multiple backbones, datasets, and modalities demonstrate that our proposed algorithm enhances fairness in text generation without compromising overall performance. Specifically, the disparities among various groups across different metrics were diminished by more than 30% with our algorithm, while the relative change in text generation accuracy was typically within 2%. By reducing the bias generated by deep learning models, our proposed approach can potentially alleviate concerns about the fairness and reliability of text generation diagnosis in medical domain. Our code is publicly available to facilitate further research at https://github.com/iriscxy/GenFair.
中文: 医疗文本生成中的人工智能系统在不同人口群体间存在性能差异,而提出的选择性优化算法在保持整体准确率波动不超过2%的同时,将各类指标下的群体差异降低了30%以上。
English: AI systems in medical text generation exhibit performance disparities across demographic groups, but a proposed selective optimization algorithm effectively reduces bias by over 30% while maintaining overall accuracy within 2% variation.

Authors:Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang
Title: Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Abstract:
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.
中文摘要:PaperCoder是一个多智能体大语言模型框架,通过规划、分析和生成三个阶段将机器学习论文自动转化为功能性代码库,在人工评估和基准测试中均展现出卓越性能。
English Summary: PaperCoder is a multi-agent LLM framework that automatically converts machine learning papers into functional code repositories through planning, analysis, and generation phases, demonstrating superior performance in both human and benchmark evaluations.

Authors:Chanhee Park, Hyeonseok Moon, Chanjun Park, Heuiseok Lim
Title: MIRAGE: A Metric-Intensive Benchmark for Retrieval-Augmented Generation Evaluation
Abstract:
Retrieval-Augmented Generation (RAG) has gained prominence as an effective method for enhancing the generative capabilities of Large Language Models (LLMs) through the incorporation of external knowledge. However, the evaluation of RAG systems remains a challenge, due to the intricate interplay between retrieval and generation components. This limitation has resulted in a scarcity of benchmarks that facilitate a detailed, component-specific assessment. In this work, we present MIRAGE, a Question Answering dataset specifically designed for RAG evaluation. MIRAGE consists of 7,560 curated instances mapped to a retrieval pool of 37,800 entries, enabling an efficient and precise evaluation of both retrieval and generation tasks. We also introduce novel evaluation metrics aimed at measuring RAG adaptability, encompassing dimensions such as noise vulnerability, context acceptability, context insensitivity, and context misinterpretation. Through comprehensive experiments across various retriever-LLM configurations, we provide new insights into the optimal alignment of model pairs and the nuanced dynamics within RAG systems. The dataset and evaluation code are publicly available, allowing for seamless integration and customization in diverse research settings\footnote{The MIRAGE code and data are available at https://github.com/nlpai-lab/MIRAGE.
中文: MIRAGE 是一个专为检索增强生成系统评估设计的问答数据集,包含精心筛选的实例和新型评估指标,能够全面衡量不同检索器与大语言模型组合的适应性与动态特性。
English: MIRAGE is a specialized dataset for evaluating Retrieval-Augmented Generation systems, featuring curated question-answer pairs and novel metrics to assess adaptability across different retriever-LLM configurations.

Authors:Hannah Cyberey, David Evans
Title: Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Abstract:
Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this "censorship" works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through "thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector. Our code is publicly available at: https://github.com/hannahxchen/llm-censorship-steering
中文: 本研究运用表征工程技术识别并操控安全调优语言模型中的审查向量,从而控制模型的拒绝行为,并揭示可通过逆向应用向量消除审查的思维抑制机制。
English: This study uses representation engineering to identify and manipulate censorship vectors in safety-tuned language models, enabling control over refusal behaviors and revealing thought suppression mechanisms that can be reversed to eliminate censorship.

Authors:Emre Can Acikgoz, Cheng Qian, Hongru Wang, Vardhan Dongre, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür, Gokhan Tur
Title: A Desideratum for Conversational Agents: Capabilities, Challenges, and Future Directions
Abstract:
Recent advances in Large Language Models (LLMs) have propelled conversational AI from traditional dialogue systems into sophisticated agents capable of autonomous actions, contextual awareness, and multi-turn interactions with users. Yet, fundamental questions about their capabilities, limitations, and paths forward remain open. This survey paper presents a desideratum for next-generation Conversational Agents - what has been achieved, what challenges persist, and what must be done for more scalable systems that approach human-level intelligence. To that end, we systematically analyze LLM-driven Conversational Agents by organizing their capabilities into three primary dimensions: (i) Reasoning - logical, systematic thinking inspired by human intelligence for decision making, (ii) Monitor - encompassing self-awareness and user interaction monitoring, and (iii) Control - focusing on tool utilization and policy following. Building upon this, we introduce a novel taxonomy by classifying recent work on Conversational Agents around our proposed desideratum. We identify critical research gaps and outline key directions, including realistic evaluations, long-term multi-turn reasoning skills, self-evolution capabilities, collaborative and multi-agent task completion, personalization, and proactivity. This work aims to provide a structured foundation, highlight existing limitations, and offer insights into potential future research directions for Conversational Agents, ultimately advancing progress toward Artificial General Intelligence (AGI). We maintain a curated repository of papers at: https://github.com/emrecanacikgoz/awesome-conversational-agents.
中文: 本综述将基于大语言模型的对话智能体划分为推理、监控与控制三大维度,通过构建新分类法指出研究空白与未来方向,以推动实现类人智能与通用人工智能的进展。
English: This survey organizes LLM-driven conversational agents into reasoning, monitoring, and control dimensions, proposing a taxonomy to address research gaps and future directions toward achieving human-like intelligence and AGI.

Authors:Hanwen Du, Bo Peng, Xia Ning
Title: Planning with Diffusion Models for Target-Oriented Dialogue Systems
Abstract:
Target-Oriented Dialogue (TOD) remains a significant challenge in the LLM era, where strategic dialogue planning is crucial for directing conversations toward specific targets. However, existing dialogue planning methods generate dialogue plans in a step-by-step sequential manner, and may suffer from compounding errors and myopic actions. To address these limitations, we introduce a novel dialogue planning framework, DiffTOD, which leverages diffusion models to enable non-sequential dialogue planning. DiffTOD formulates dialogue planning as a trajectory generation problem with conditional guidance, and leverages a diffusion language model to estimate the likelihood of the dialogue trajectory. To optimize the dialogue action strategies, DiffTOD introduces three tailored guidance mechanisms for different target types, offering flexible guidance toward diverse TOD targets at test time. Extensive experiments across three diverse TOD settings show that DiffTOD can effectively perform non-myopic lookahead exploration and optimize action strategies over a long horizon through non-sequential dialogue planning, and demonstrates strong flexibility across complex and diverse dialogue scenarios. Our code and data are accessible through https://github.com/ninglab/DiffTOD.
中文:DiffTOD提出了一种创新的对话规划框架,利用扩散模型实现非顺序规划,通过针对不同目标类型的定制引导机制,有效解决了目标导向对话中的复合错误和短视行为,并在多样场景中展现出强大灵活性。
English: DiffTOD introduces a novel dialogue planning framework using diffusion models to enable non-sequential planning, addressing compounding errors and myopic actions in Target-Oriented Dialogue by optimizing strategies through tailored guidance mechanisms across diverse scenarios.

Authors:Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang
Title: Process Reward Models That Think
Abstract:
Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.
中文: ThinkPRM是一种生成式长思维链验证器,仅需1%的过程标注即可在多个基准测试中超越现有方法,以极少的监督实现高效验证计算扩展。
English: ThinkPRM is a generative, long chain-of-thought verifier that achieves superior performance across multiple benchmarks using only 1% of process labels, effectively scaling test-time verification with minimal supervision.

Authors:Aniketh Garikaparthi, Manasi Patwardhan, Lovekesh Vig, Arman Cohan
Title: IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery
Abstract:
The rapid advancement in capabilities of large language models (LLMs) raises a pivotal question: How can LLMs accelerate scientific discovery? This work tackles the crucial first stage of research, generating novel hypotheses. While recent work on automated hypothesis generation focuses on multi-agent frameworks and extending test-time compute, none of the approaches effectively incorporate transparency and steerability through a synergistic Human-in-the-loop (HITL) approach. To address this gap, we introduce IRIS: Interactive Research Ideation System, an open-source platform designed for researchers to leverage LLM-assisted scientific ideation. IRIS incorporates innovative features to enhance ideation, including adaptive test-time compute expansion via Monte Carlo Tree Search (MCTS), fine-grained feedback mechanism, and query-based literature synthesis. Designed to empower researchers with greater control and insight throughout the ideation process. We additionally conduct a user study with researchers across diverse disciplines, validating the effectiveness of our system in enhancing ideation. We open-source our code at https://github.com/Anikethh/IRIS-Interactive-Research-Ideation-System
中文: 本文提出开源平台IRIS,通过自适应计算和文献整合等功能将人类反馈与大型语言模型相结合,有效提升科学假说生成能力,并经过跨学科用户研究验证。
English: This paper introduces IRIS, an open-source platform that enhances scientific hypothesis generation by integrating human feedback with LLMs through features like adaptive computation and literature synthesis, validated by a cross-disciplinary user study.

Authors:Hanlei Zhang, Zhuohang Li, Yeshuang Zhu, Hua Xu, Peiwu Wang, Haige Zhu, Jie Zhou, Jinchao Zhang
Title: Can Large Language Models Help Multimodal Language Analysis? MMLA: A Comprehensive Benchmark
Abstract:
Multimodal language analysis is a rapidly evolving field that leverages multiple modalities to enhance the understanding of high-level semantics underlying human conversational utterances. Despite its significance, little research has investigated the capability of multimodal large language models (MLLMs) to comprehend cognitive-level semantics. In this paper, we introduce MMLA, a comprehensive benchmark specifically designed to address this gap. MMLA comprises over 61K multimodal utterances drawn from both staged and real-world scenarios, covering six core dimensions of multimodal semantics: intent, emotion, dialogue act, sentiment, speaking style, and communication behavior. We evaluate eight mainstream branches of LLMs and MLLMs using three methods: zero-shot inference, supervised fine-tuning, and instruction tuning. Extensive experiments reveal that even fine-tuned models achieve only about 60%~70% accuracy, underscoring the limitations of current MLLMs in understanding complex human language. We believe that MMLA will serve as a solid foundation for exploring the potential of large language models in multimodal language analysis and provide valuable resources to advance this field. The datasets and code are open-sourced at https://github.com/thuiar/MMLA.
中文: 本文提出MMLA基准,专门评估多模态大语言模型在六种语义维度上理解认知层面语义的能力,实验表明即使经过微调的模型准确率也仅达60%-70%,凸显出现有模型在理解复杂人类语言方面的局限。
English: This paper introduces MMLA, a comprehensive benchmark for evaluating multimodal large language models' ability to understand cognitive-level semantics across six dimensions, revealing current models' limitations with only 60%-70% accuracy despite extensive testing.

Authors:Jiahao Yuan, Xingzhe Sun, Xing Yu, Jingwen Wang, Dehui Du, Zhiqing Cui, Zixiang Di
Title: LLMSR@XLLM25: Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation
Abstract:
The LLMSR@XLLM25 formulates a low-resource structural reasoning task that challenges LLMs to generate interpretable, step-by-step rationales with minimal labeled data. We present Less is More, the third-place winning approach in the LLMSR@XLLM25, which focuses on structured reasoning from only 24 labeled examples. Our approach leverages a multi-agent framework with reverse-prompt induction, retrieval-augmented reasoning synthesis via GPT-4o, and dual-stage reward-guided filtering to distill high-quality supervision across three subtasks: question parsing, CoT parsing, and step-level verification. All modules are fine-tuned from Meta-Llama-3-8B-Instruct under a unified LoRA+ setup. By combining structure validation with reward filtering across few-shot and zero-shot prompts, our pipeline consistently improves structure reasoning quality. These results underscore the value of controllable data distillation in enhancing structured inference under low-resource constraints. Our code is available at https://github.com/JhCircle/Less-is-More.
中文摘要:在LLMSR@XLLM25竞赛中获得第三名的“Less is More”方法,通过多智能体框架结合逆向提示诱导和奖励引导过滤,仅用24个标注样本就提升了结构化推理能力,证明了可控数据蒸馏在低资源条件下的有效性。
English Summary: The "Less is More" approach, which won third place in the LLMSR@XLLM25 competition, employs a multi-agent framework with reverse-prompt induction and reward-guided filtering to enhance structured reasoning using only 24 labeled examples, demonstrating the effectiveness of controllable data distillation under low-resource constraints.

Authors:Yuxin Zuo, Kaiyan Zhang, Li Sheng, Shang Qu, Ganqu Cui, Xuekai Zhu, Haozhan Li, Yuchen Zhang, Xinwei Long, Ermo Hua, Biqing Qi, Youbang Sun, Zhiyuan Ma, Lifan Yuan, Ning Ding, Bowen Zhou
Title: TTRL: Test-Time Reinforcement Learning
Abstract:
This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 211% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the maj@n metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model maj@n, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL
本文提出了一种名为测试时强化学习(TTRL)的新方法,通过利用多数投票等测试时扩展技术进行奖励估计,使大语言模型能够在无标注测试数据上借助强化学习实现自我进化。
This paper introduces Test-Time Reinforcement Learning (TTRL), a novel method that enables large language models to self-improve using reinforcement learning on unlabeled test data by leveraging test-time scaling techniques like majority voting for reward estimation.

Authors:Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu
Title: Survey of Video Diffusion Models: Foundations, Implementations, and Applications
Abstract:
Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.
中文: 扩散模型以卓越的时序一致性和视觉质量革新了视频生成领域,本综述系统性地梳理了其技术演进与方法体系,在剖析运动连贯性等挑战的同时,为研究者提供了涵盖评估指标与工程实践的完整资源库。
English: Diffusion models have revolutionized video generation with superior quality and temporal consistency, though challenges in motion coherence and efficiency persist, as this comprehensive survey systematically reviews their evolution, methodologies, and applications while offering an updated perspective on the field.

Authors:Zhifan Ye, Kejing Xia, Yonggan Fu, Xin Dong, Jihoon Hong, Xiangchi Yuan, Shizhe Diao, Jan Kautz, Pavlo Molchanov, Yingyan Celine Lin
Title: LongMamba: Enhancing Mamba's Long Context Capabilities via Training-Free Receptive Field Enlargement
Abstract:
State space models (SSMs) have emerged as an efficient alternative to Transformer models for language modeling, offering linear computational complexity and constant memory usage as context length increases. However, despite their efficiency in handling long contexts, recent studies have shown that SSMs, such as Mamba models, generally underperform compared to Transformers in long-context understanding tasks. To address this significant shortfall and achieve both efficient and accurate long-context understanding, we propose LongMamba, a training-free technique that significantly enhances the long-context capabilities of Mamba models. LongMamba builds on our discovery that the hidden channels in Mamba can be categorized into local and global channels based on their receptive field lengths, with global channels primarily responsible for long-context capability. These global channels can become the key bottleneck as the input context lengthens. Specifically, when input lengths largely exceed the training sequence length, global channels exhibit limitations in adaptively extend their receptive fields, leading to Mamba's poor long-context performance. The key idea of LongMamba is to mitigate the hidden state memory decay in these global channels by preventing the accumulation of unimportant tokens in their memory. This is achieved by first identifying critical tokens in the global channels and then applying token filtering to accumulate only those critical tokens. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for Mamba's long-context performance, significantly extending its operational range without requiring additional training. Our code is available at https://github.com/GATECH-EIC/LongMamba.
中文:LongMamba是一种无需训练的技术,通过识别并筛选全局通道中的关键令牌来缓解内存衰减,从而显著提升Mamba模型的长上下文理解能力,且无需额外训练。
English: LongMamba is a training-free technique that enhances Mamba models' long-context understanding by identifying and filtering critical tokens in global channels to mitigate memory decay, significantly improving performance without additional training.

Authors:Luwei Xiao, Rui Mao, Shuai Zhao, Qika Lin, Yanhao Jia, Liang He, Erik Cambria
Title: Exploring Cognitive and Aesthetic Causality for Multimodal Aspect-Based Sentiment Analysis
Abstract:
Multimodal aspect-based sentiment classification (MASC) is an emerging task due to an increase in user-generated multimodal content on social platforms, aimed at predicting sentiment polarity toward specific aspect targets (i.e., entities or attributes explicitly mentioned in text-image pairs). Despite extensive efforts and significant achievements in existing MASC, substantial gaps remain in understanding fine-grained visual content and the cognitive rationales derived from semantic content and impressions (cognitive interpretations of emotions evoked by image content). In this study, we present Chimera: a cognitive and aesthetic sentiment causality understanding framework to derive fine-grained holistic features of aspects and infer the fundamental drivers of sentiment expression from both semantic perspectives and affective-cognitive resonance (the synergistic effect between emotional responses and cognitive interpretations). Specifically, this framework first incorporates visual patch features for patch-word alignment. Meanwhile, it extracts coarse-grained visual features (e.g., overall image representation) and fine-grained visual regions (e.g., aspect-related regions) and translates them into corresponding textual descriptions (e.g., facial, aesthetic). Finally, we leverage the sentimental causes and impressions generated by a large language model (LLM) to enhance the model's awareness of sentimental cues evoked by semantic content and affective-cognitive resonance. Experimental results on standard MASC datasets demonstrate the effectiveness of the proposed model, which also exhibits greater flexibility to MASC compared to LLMs such as GPT-4o. We have publicly released the complete implementation and dataset at https://github.com/Xillv/Chimera
Chinese: 该研究提出了Chimera框架,通过整合细粒度视觉特征和大语言模型的认知情感解释,提升了多模态方面级情感分类的性能,实验证明其效果优于GPT-4o等现有方法。
English: The study introduces Chimera, a framework that enhances multimodal aspect-based sentiment classification by integrating fine-grained visual features and cognitive-affective interpretations through large language models, demonstrating superior performance over existing methods like GPT-4o.

Authors:Daocheng Fu, Jianlong Chen, Renqiu Xia, Zijun Chen, Qi Liu, Yuan Feng, Hongbin Zhou, Renrui Zhang, Shiyang Feng, Peng Gao, Hongyuan Zha, Junchi Yan, Botian Shi, Yu Qiao, Bo Zhang
Title: TrustGeoGen: Formal-Verified Data Engine for Trustworthy Multi-modal Geometric Problem Solving
Abstract:
Mathematical geometric problem solving (GPS) demands verifiable logical coherence and multimodal reasoning capabilities. While large language models (LLMs) have shown rapid progress in GPS, their advancement is hindered by the lack of reliable benchmarks and systematic methodologies. A critical challenge is the inherent hallucination in LLMs, which leads to synthetic GPS datasets that are often noisy, unverified, and self-contradictory. To address this, we introduce TrustGeoGen, a data engine that generates formally verified geometric problems to establish a principled and trustworthy benchmark. Our engine integrates four key innovations: 1) Multimodal Alignment, which synchronizes the generation of diagrams, text, and step-by-step solutions; 2) Formal Verification, ensuring all reasoning paths are rule-compliant; 3) Connection Thinking, bridging formal deduction with human-like logical steps; and 4) our \textit{GeoExplore} series algorithms, which produce diverse problem variants with multiple solutions and self-reflective backtracking. Using this engine, we create the GeoTrust-200K dataset and the corresponding GeoTrust-test benchmark, both with guaranteed cross-modal integrity. Experiments reveal that state-of-the-art models achieve only 45.83\% accuracy on GeoTrust-test, highlighting its significant challenge. Furthermore, training on our synthesized data substantially improves model performance on GPS tasks, with strong generalization to out-of-domain (OOD) benchmarks. Our code and data are available at https://github.com/Alpha-Innovator/TrustGeoGen
中文: TrustGeoGen 是一个生成经过形式化验证的几何问题的数据引擎,通过创建GeoTrust-200K数据集和GeoTrust-test基准,有效应对大语言模型的幻觉问题,显著提升了模型在几何问题解决中的表现和泛化能力。
English: TrustGeoGen is a data engine that creates formally verified geometric problems to address LLM hallucinations, producing the GeoTrust-200K dataset and GeoTrust-test benchmark, which significantly challenge existing models and enhance their performance and generalization in geometric problem solving.

Authors:Anjiang Wei, Huanmi Tan, Tarun Suresh, Daniel Mendoza, Thiago S. F. X. Teixeira, Ke Wang, Caroline Trippel, Alex Aiken
Title: VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation
Abstract:
Recent advances in Large Language Models (LLMs) have sparked growing interest in applying them to Electronic Design Automation (EDA) tasks, particularly Register Transfer Level (RTL) code generation. While several RTL datasets have been introduced, most focus on syntactic validity rather than functional validation with tests, leading to training examples that compile but may not implement the intended behavior. We present VERICODER, a model for RTL code generation fine-tuned on a dataset validated for functional correctness. This fine-tuning dataset is constructed using a novel methodology that combines unit test generation with feedback-directed refinement. Given a natural language specification and an initial RTL design, we prompt a teacher model (GPT-4o-mini) to generate unit tests and iteratively revise the RTL design based on its simulation results using the generated tests. If necessary, the teacher model also updates the tests to ensure they comply with the natural language specification. As a result of this process, every example in our dataset is functionally validated, consisting of a natural language description, an RTL implementation, and passing tests. Fine-tuned on this dataset of 125,777 examples, VERICODER achieves state-of-the-art metrics in functional correctness on VerilogEval and RTLLM, with relative gains of up to 71.7% and 27.4%, respectively. An ablation study further shows that models trained on our functionally validated dataset outperform those trained on functionally non-validated datasets, underscoring the importance of high-quality datasets in RTL code generation. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/VeriCoder
Chinese: VERICODER是一个基于功能验证数据集微调的RTL代码生成模型,通过结合单元测试生成和反馈导向优化的新方法,在功能正确性方面达到了最先进的性能。
English: VERICODER is a model fine-tuned on a functionally validated RTL code generation dataset, achieving state-of-the-art performance in functional correctness through a novel methodology combining unit test generation and feedback-directed refinement.

Authors:Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang
Title: Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Abstract:
The improvement of LLMs' instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm--Web as Instruction and Web as Response--where each web document is designated as either an instruction or a response to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort. The data and code are publicly available at https://github.com/YJiangcm/WebR.
Chinese: WebR 是一种自动化框架,通过“网页作为指令”和“网页作为响应”的双重视角范式,直接从原始网页文档中合成高质量的指令微调数据,在四项基准测试中性能最高提升16.65%,并展现出卓越的兼容性和可扩展性。
English: WebR is an automated framework that synthesizes high-quality instruction-response pairs from raw web documents through a dual-perspective paradigm, significantly outperforming existing methods by up to 16.65% across benchmarks while demonstrating superior compatibility and scalability.

Authors:Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
Title: CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Abstract:
Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe
中文摘要:CAPTURe任务通过评估视觉语言模型对遮挡物后方图案化物体的计数能力,发现即使先进模型也难以进行空间推理,而人类表现近乎完美。
English Summary: The CAPTURe task evaluates vision-language models' ability to count patterned objects behind occlusions, revealing that even advanced models struggle with spatial reasoning about hidden objects while humans perform nearly flawlessly.

Authors:Jiayi Pan, Xiuyu Li, Long Lian, Charlie Snell, Yifei Zhou, Adam Yala, Trevor Darrell, Kurt Keutzer, Alane Suhr
Title: Learning Adaptive Parallel Reasoning with Language Models
Abstract:
Scaling inference-time computation has substantially improved the reasoning capabilities of language models. However, existing methods have significant limitations: serialized chain-of-thought approaches generate overly long outputs, leading to increased latency and exhausted context windows, while parallel methods such as self-consistency suffer from insufficient coordination, resulting in redundant computations and limited performance gains. To address these shortcomings, we propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end. APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations. A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures. Experiments on the Countdown reasoning task demonstrate significant benefits of APR: (1) higher performance within the same context window (83.4% vs. 60.0% at 4k context); (2) superior scalability with increased computation (80.1% vs. 66.6% at 20k total tokens); (3) improved accuracy at equivalent latency (75.2% vs. 57.3% at approximately 5,000ms). APR represents a step towards enabling language models to autonomously optimize their reasoning processes through adaptive allocation of computation.
中文:自适应并行推理(APR)框架通过强化学习让语言模型动态协调串行与并行计算,相比现有方法在性能、可扩展性和准确性方面均实现显著提升。
English: The proposed Adaptive Parallel Reasoning (APR) framework enables language models to dynamically orchestrate serial and parallel computations through reinforcement learning, achieving superior performance, scalability, and accuracy compared to existing methods.

Authors:David Ma, Yuanxing Zhang, Jincheng Ren, Jarvis Guo, Yifan Yao, Zhenlin Wei, Zhenzhu Yang, Zhongyuan Peng, Boyu Feng, Jun Ma, Xiao Gu, Zhoufutu Wen, King Zhu, Yancheng He, Meng Cao, Shiwen Ni, Jiaheng Liu, Wenhao Huang, Ge Zhang, Xiaojie Jin
Title: IV-Bench: A Benchmark for Image-Grounded Video Perception and Reasoning in Multimodal LLMs
Abstract:
Existing evaluation frameworks for Multimodal Large Language Models (MLLMs) primarily focus on image reasoning or general video understanding tasks, largely overlooking the significant role of image context in video comprehension. To bridge this gap, we propose IV-Bench, the first comprehensive benchmark for evaluating Image-Grounded Video Perception and Reasoning. IV-Bench consists of 967 videos paired with 2,585 meticulously annotated image-text queries across 13 tasks (7 perception and 6 reasoning tasks) and 5 representative categories. Extensive evaluations of state-of-the-art open-source (e.g., InternVL2.5, Qwen2.5-VL) and closed-source (e.g., GPT-4o, Gemini2-Flash and Gemini2-Pro) MLLMs demonstrate that current models substantially underperform in image-grounded video Perception and Reasoning, merely achieving at most 28.9% accuracy. Further analysis reveals key factors influencing model performance on IV-Bench, including inference pattern, frame number, and resolution. Additionally, through a simple data synthesis approach, we demonstratethe challenges of IV- Bench extend beyond merely aligning the data format in the training proecss. These findings collectively provide valuable insights for future research. Our codes and data are released in https://github.com/multimodal-art-projection/IV-Bench.
中文: 现有的多模态大模型评估主要关注图像推理或通用视频理解任务,忽视了图像上下文在视频理解中的重要作用,因此提出了首个全面的图像基础视频感知与推理基准IV-Bench,发现当前先进模型在此任务上表现严重不足,最高准确率仅达28.9%,并揭示了影响性能的关键因素。
English: Current multimodal models are evaluated primarily on image reasoning or general video tasks, neglecting the role of image context in video understanding, so IV-Bench is introduced as the first comprehensive benchmark for image-grounded video perception and reasoning, revealing that state-of-the-art models significantly underperform with at most 28.9% accuracy and highlighting key influencing factors.

Authors:Zhiqiu Lin, Siyuan Cen, Daniel Jiang, Jay Karhade, Hewei Wang, Chancharik Mitra, Tiffany Ling, Yuhan Huang, Sifan Liu, Mingyu Chen, Rushikesh Zawar, Xue Bai, Yilun Du, Chuang Gan, Deva Ramanan
Title: Towards Understanding Camera Motions in Any Video
Abstract:
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.
中文摘要:CameraBench是一个用于评估和改进摄像机运动理解的大规模数据集和基准,包含专家标注的视频和运动基元分类法,揭示了现有模型的局限性,并通过微调实现了性能提升。
English Summary: CameraBench is a comprehensive dataset and benchmark for evaluating camera motion understanding, featuring expert-annotated videos and a taxonomy of motion primitives that reveals the limitations of current models and enables improved performance through fine-tuning.

Authors:Yuan-Hong Liao, Sven Elflein, Liu He, Laura Leal-Taixé, Yejin Choi, Sanja Fidler, David Acuna
Title: LongPerceptualThoughts: Distilling System-2 Reasoning for System-1 Perception
Abstract:
Recent reasoning models through test-time scaling have demonstrated that long chain-of-thoughts can unlock substantial performance boosts in hard reasoning tasks such as math and code. However, the benefit of such long thoughts for system-2 reasoning is relatively less explored in other domains such as perceptual tasks where shallower, system-1 reasoning seems sufficient. In this paper, we introduce LongPerceptualThoughts, a new synthetic dataset with 30K long-thought traces for perceptual tasks. The key challenges in synthesizing elaborate reasoning thoughts for perceptual tasks are that off-the-shelf models are not yet equipped with such thinking behavior and that it is not straightforward to build a reliable process verifier for perceptual tasks. Thus, we propose a novel three-stage data synthesis framework that first synthesizes verifiable multiple-choice questions from dense image descriptions, then extracts simple CoTs from VLMs for those verifiable problems, and finally expands those simple thoughts to elaborate long thoughts via frontier reasoning models. In controlled experiments with a strong instruction-tuned 7B model, we demonstrate notable improvements over existing visual reasoning data-generation methods. Our model, trained on the generated dataset, achieves an average +3.4 points improvement over 5 vision-centric benchmarks, including +11.8 points on V$^*$ Bench. Notably, despite being tuned for vision tasks, it also improves performance on the text reasoning benchmark, MMLU-Pro, by +2 points.

Authors:Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, Yi Ma
Title: Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs
Abstract:
Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we propose All-Angles Bench, a benchmark of over 2,100 human carefully annotated multi-view question-answer pairs across 90 diverse real-world scenes. Our six tasks (counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation) specifically test model's geometric correspondence and the capacity to align information consistently across views. Our extensive experiments, benchmark on 27 representative MLLMs including Gemini-2.0-Flash, Claude-3.7-Sonnet, and GPT-4o against human evaluators reveals a substantial performance gap, indicating that current MLLMs remain far from human-level proficiency. Through in-depth analysis, we show that MLLMs are particularly underperforming under two aspects: (1) cross-view correspondence for partially occluded views and (2) establishing the coarse camera poses. These findings highlight the necessity of domain-specific refinements or modules that embed stronger multi-view awareness. We believe that our All-Angles Bench offers valuable insights and contribute to bridging the gap between MLLMs and human-level multi-view understanding. The project and benchmark are publicly available at https://danielchyeh.github.io/All-Angles-Bench/.

Authors:Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Abstract:
We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
中文摘要:本研究设计了一套最小算法任务来评估语言模型的创造性局限,发现多标记方法在生成多样性输出上优于单标记学习,且输入层噪声注入在平衡随机性与连贯性方面不亚于温度采样。
English Summary: This study introduces minimal algorithmic tasks to evaluate the creative limitations of language models, demonstrating that multi-token approaches outperform next-token learning in generating diverse outputs and that input-layer noise injection rivals temperature sampling for balancing randomness and coherence.

Authors:Anirudh Khatry, Robert Zhang, Jia Pan, Ziteng Wang, Qiaochu Chen, Greg Durrett, Isil Dillig
Title: CRUST-Bench: A Comprehensive Benchmark for C-to-safe-Rust Transpilation
Abstract:
C-to-Rust transpilation is essential for modernizing legacy C code while enhancing safety and interoperability with modern Rust ecosystems. However, no dataset currently exists for evaluating whether a system can transpile C into safe Rust that passes a set of test cases. We introduce CRUST-Bench, a dataset of 100 C repositories, each paired with manually-written interfaces in safe Rust as well as test cases that can be used to validate correctness of the transpilation. By considering entire repositories rather than isolated functions, CRUST-Bench captures the challenges of translating complex projects with dependencies across multiple files. The provided Rust interfaces provide explicit specifications that ensure adherence to idiomatic, memory-safe Rust patterns, while the accompanying test cases enforce functional correctness. We evaluate state-of-the-art large language models (LLMs) on this task and find that safe and idiomatic Rust generation is still a challenging problem for various state-of-the-art methods and techniques. We also provide insights into the errors LLMs usually make in transpiling code from C to safe Rust. The best performing model, OpenAI o1, is able to solve only 15 tasks in a single-shot setting. Improvements on CRUST-Bench would lead to improved transpilation systems that can reason about complex scenarios and help in migrating legacy codebases from C into languages like Rust that ensure memory safety. You can find the dataset and code at https://github.com/anirudhkhatry/CRUST-bench.
Chinese: CRUST-Bench是一个包含100个C语言仓库及对应安全Rust接口与测试用例的数据集,用于评估C到Rust的转译系统,结果表明现有方法在生成安全地道的Rust代码方面仍面临挑战,最优模型仅能完成15项任务。
English: CRUST-Bench is a dataset of 100 C repositories with safe Rust interfaces and test cases, designed to evaluate C-to-Rust transpilation systems, revealing that current methods struggle with generating safe, idiomatic Rust code as even the best model solved only 15 tasks.

Authors:Yilun Zhou, Austin Xu, Peifeng Wang, Caiming Xiong, Shafiq Joty
Title: Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Abstract:
Scaling test-time computation, or affording a generator large language model (LLM) extra compute during inference, typically employs the help of external non-generative evaluators (i.e., reward models). Concurrently, LLM-judges, models trained to generate evaluations and critiques (explanations) in natural language, are becoming increasingly popular in automatic evaluation. Despite judge empirical successes, their effectiveness as evaluators in test-time scaling settings is largely unknown. In this paper, we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement. We evaluate 10 different judge models (7B-70B parameters) for 8 different base generator models (6.7B-72B parameters). Our benchmark shows that while judges are competitive with outcome reward models in reranking, they are consistently worse than process reward models in beam search procedures. Furthermore, though unique to LLM-judges, their natural language critiques are currently ineffective in guiding the generator towards better responses.
中文: JETTS基准测试表明,尽管LLM评判模型在重排序任务中与结果奖励模型表现相当,但在束搜索中不如过程奖励模型,且其自然语言评析目前无法有效指导生成模型改进回答。
English: The JETTS benchmark reveals that while LLM-judges are competitive with outcome reward models in reranking tasks, they underperform process reward models in beam search and their natural language critiques currently fail to effectively guide generators toward improved responses.

Authors:Juyeon Kim, Geon Lee, Taeuk Kim, Kijung Shin
Title: KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking
Abstract:
Entity linking (EL) aligns textual mentions with their corresponding entities in a knowledge base, facilitating various applications such as semantic search and question answering. Recent advances in multimodal entity linking (MEL) have shown that combining text and images can reduce ambiguity and improve alignment accuracy. However, most existing MEL methods overlook the rich structural information available in the form of knowledge-graph (KG) triples. In this paper, we propose KGMEL, a novel framework that leverages KG triples to enhance MEL. Specifically, it operates in three stages: (1) Generation: Produces high-quality triples for each mention by employing vision-language models based on its text and images. (2) Retrieval: Learns joint mention-entity representations, via contrastive learning, that integrate text, images, and (generated or KG) triples to retrieve candidate entities for each mention. (3) Reranking: Refines the KG triples of the candidate entities and employs large language models to identify the best-matching entity for the mention. Extensive experiments on benchmark datasets demonstrate that KGMEL outperforms existing methods. Our code and datasets are available at: https://github.com/juyeonnn/KGMEL.
中文: 本文提出KGMEL框架,通过生成、检索和重排三阶段整合知识图谱三元组来提升多模态实体链接的准确性,实验证明其优于现有方法。
English: The paper introduces KGMEL, a multimodal entity linking framework that enhances accuracy by incorporating knowledge-graph triples through generation, retrieval, and reranking stages, outperforming existing methods in experiments.

Authors:Ziwen Xu, Shuxun Wang, Kewei Xu, Haoming Xu, Mengru Wang, Xinle Deng, Yunzhi Yao, Guozhou Zheng, Huajun Chen, Ningyu Zhang
Title: EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Abstract:
In this paper, we introduce EasyEdit2, a framework designed to enable plug-and-play adjustability for controlling Large Language Model (LLM) behaviors. EasyEdit2 supports a wide range of test-time interventions, including safety, sentiment, personality, reasoning patterns, factuality, and language features. Unlike its predecessor, EasyEdit2 features a new architecture specifically designed for seamless model steering. It comprises key modules such as the steering vector generator and the steering vector applier, which enable automatic generation and application of steering vectors to influence the model's behavior without modifying its parameters. One of the main advantages of EasyEdit2 is its ease of use-users do not need extensive technical knowledge. With just a single example, they can effectively guide and adjust the model's responses, making precise control both accessible and efficient. Empirically, we report model steering performance across different LLMs, demonstrating the effectiveness of these techniques. We have released the source code on GitHub at https://github.com/zjunlp/EasyEdit along with a demonstration notebook. In addition, we provide a demo video at https://www.youtube.com/watch?v=AkfoiPfp5rQ for a quick introduction.
中文:EasyEdit2是一个即插即用的框架,通过测试时干预让用户无需深厚技术知识即可轻松控制大型语言模型的行为,仅需一个示例即可精确调整模型响应。
English: EasyEdit2 is a plug-and-play framework that enables users to easily control Large Language Model behaviors through test-time interventions, requiring minimal technical knowledge and allowing precise adjustments with just a single example.

Authors:Quy-Anh Dang, Chris Ngo, Truong-Son Hy
Title: RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities but are susceptible to adversarial prompts that exploit vulnerabilities to produce unsafe or biased outputs. Existing red-teaming methods often face scalability challenges, resource-intensive requirements, or limited diversity in attack strategies. We propose RainbowPlus, a novel red-teaming framework rooted in evolutionary computation, enhancing adversarial prompt generation through an adaptive quality-diversity (QD) search that extends classical evolutionary algorithms like MAP-Elites with innovations tailored for language models. By employing a multi-element archive to store diverse high-quality prompts and a comprehensive fitness function to evaluate multiple prompts concurrently, RainbowPlus overcomes the constraints of single-prompt archives and pairwise comparisons in prior QD methods like Rainbow Teaming. Experiments comparing RainbowPlus to QD methods across six benchmark datasets and four open-source LLMs demonstrate superior attack success rate (ASR) and diversity (Diverse-Score $\approx 0.84$), generating up to 100 times more unique prompts (e.g., 10,418 vs. 100 for Ministral-8B-Instruct-2410). Against nine state-of-the-art methods on the HarmBench dataset with twelve LLMs (ten open-source, two closed-source), RainbowPlus achieves an average ASR of 81.1%, surpassing AutoDAN-Turbo by 3.9%, and is 9 times faster (1.45 vs. 13.50 hours). Our open-source implementation fosters further advancements in LLM safety, offering a scalable tool for vulnerability assessment. Code and resources are publicly available at https://github.com/knoveleng/rainbowplus, supporting reproducibility and future research in LLM red-teaming.
Chinese: RainbowPlus是一种基于进化计算的新型红队框架,通过自适应质量多样性搜索生成多样化的对抗性提示,在多个大语言模型和数据集上显著超越了现有方法的攻击成功率与效率。
English: RainbowPlus is an innovative red-teaming framework using evolutionary computation to generate diverse and effective adversarial prompts, significantly outperforming existing methods in attack success rate and efficiency across multiple LLMs and datasets.

Authors:Yingming Zheng, Xiaoliang Liu, Peng Wu, Li Pan
Title: CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs
Abstract:
The rapid spread of misinformation, driven by digital media and AI-generated content, has made automatic claim verification essential. Traditional methods, which depend on expert-annotated evidence, are labor-intensive and not scalable. Although recent automated systems have improved, they still struggle with complex claims that require nuanced reasoning. To address this, we propose CRAVE, a Conflicting Reasoning Approach for explainable claim VErification, that verify the complex claims based on the conflicting rationales reasoned by large language models (LLMs). Specifically, CRAVE introduces a three-module framework. Ambiguity Elimination enchanced Evidence Retrieval module performs ambiguity elimination and entity-based search to gather relevant evidence related to claim verification from external sources like Wikipedia. Conflicting Perspective Reasoning and Preliminary Judgment module with LLMs adopts LLMs to reason rationales with conflicting stances about claim verification from retrieved evidence across four dimensions, i.e., direct evidence, semantic relationships, linguistic patterns, and logical reasoning and make a preliminary judgment. Finally, Small Language Model (SLM) based Judge module is fine-tuned to make use of preliminary judgment from LLMs to assess the confidence of the conflicting rationales and make a final authenticity judgment. This methodology allows CRAVE to capture subtle inconsistencies in complex claims, improving both the accuracy and transparency of claim verification. Extensive experiments on two public claim verification datasets demonstrate that our CRAVE model achieves much better performance than state-of-the-art methods and exhibits a superior capacity for finding relevant evidence and explaining the model predictions. The code is provided at https://github.com/8zym/CRAVE.
中文摘要:针对传统声明验证方法的局限性,我们提出CRAVE框架,利用大语言模型生成对立推理依据,并通过微调的小语言模型进行最终判定,显著提升了复杂声明验证的准确性与可解释性。
English Summary: To address the limitations of traditional claim verification methods, we propose CRAVE, a framework that leverages large language models to generate conflicting rationales and uses a fine-tuned small language model for final judgment, significantly improving accuracy and transparency in verifying complex claims.

Authors:Yiting Ran, Xintao Wang, Tian Qiu, Jiaqing Liang, Yanghua Xiao, Deqing Yang
Title: BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation
Abstract:
Recent advances in large language models (LLMs) have enabled social simulation through multi-agent systems. Prior efforts focus on agent societies created from scratch, assigning agents with newly defined personas. However, simulating established fictional worlds and characters remain largely underexplored, despite its significant practical value. In this paper, we introduce BookWorld, a comprehensive system for constructing and simulating book-based multi-agent societies. BookWorld's design covers comprehensive real-world intricacies, including diverse and dynamic characters, fictional worldviews, geographical constraints and changes, e.t.c. BookWorld enables diverse applications including story generation, interactive games and social simulation, offering novel ways to extend and explore beloved fictional works. Through extensive experiments, we demonstrate that BookWorld generates creative, high-quality stories while maintaining fidelity to the source books, surpassing previous methods with a win rate of 75.36%. The code of this paper can be found at the project page: https://bookworld2025.github.io/.
中文: 本文提出BookWorld系统,通过构建基于书籍的多智能体社会来模拟复杂虚构世界,在保持原著忠实度的同时实现故事生成等应用,并以75.36%的胜率超越现有方法。
English: This paper introduces BookWorld, a system for simulating book-based multi-agent societies that captures intricate fictional elements and enables applications like story generation, outperforming prior methods with a 75.36% win rate while maintaining fidelity to source materials.

Authors:Tong Zeng, Longfeng Wu, Liang Shi, Dawei Zhou, Feng Guo
Title: Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding
Abstract:
Vision Large Language Models (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.
Chinese Summary: 视觉大语言模型在通用视觉任务中表现出色,但在自动驾驶等安全关键领域的应用仍存局限;新推出的DVBench基准测试揭示了现有模型在复杂驾驶场景理解上的不足,并通过领域微调显著提升了模型性能,为开发符合实际安全要求的视觉大语言模型提供了重要评估框架。
English Summary: Vision Large Language Models (VLLMs) show strong performance in general visual tasks but struggle with safety-critical autonomous driving scenarios, as demonstrated by the new DVBench benchmark which revealed significant performance gaps and the need for domain-specific fine-tuning to improve their applicability.

Authors:Xiang Li, Duyi Pan, Hongru Xiao, Jiale Han, Jing Tang, Jiabao Ma, Wei Wang, Bo Cheng
Title: DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue
Abstract:
Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity, contextual scenarios, and emotional expressiveness. To address these issues, we propose DialogueAgents, a novel hybrid agent-based speech synthesis framework, which integrates three specialized agents -- a script writer, a speech synthesizer, and a dialogue critic -- to collaboratively generate dialogues. Grounded in a diverse character pool, the framework iteratively refines dialogue scripts and synthesizes speech based on speech review, boosting emotional expressiveness and paralinguistic features of the synthesized dialogues. Using DialogueAgent, we contribute MultiTalk, a bilingual, multi-party, multi-turn speech dialogue dataset covering diverse topics. Extensive experiments demonstrate the effectiveness of our framework and the high quality of the MultiTalk dataset. We release the dataset and code https://github.com/uirlx/DialogueAgents to facilitate future research on advanced speech synthesis models and customized data generation.
中文:DialogueAgents框架通过三个专业代理协同生成富有表现力的多样化语音对话,创建了高质量的MultiTalk数据集,有效解决了现有数据集成本高、多样性不足的问题。
English: The DialogueAgents framework uses three specialized agents to collaboratively generate expressive, diverse speech dialogues, producing the high-quality MultiTalk dataset and addressing limitations of costly and limited existing datasets.

Authors:Liu Xiao, Li Zhiyuan, Lin Yueyu
Title: Cross-attention for State-based model RWKV-7
Abstract:
We introduce CrossWKV, a novel cross-attention mechanism for the state-based RWKV-7 model, designed to enhance the expressive power of text-to-image generation. Leveraging RWKV-7's linear-complexity Weighted Key-Value (WKV) architecture, CrossWKV integrates text and image modalities in a single pass, utilizing a generalized delta rule with vector-valued gating and low-rank adaptations (LoRA) to achieve superior cross-modal alignment. Unlike Transformer-based models, CrossWKV's non-diagonal, input-dependent transition matrix enables it to represent complex functions beyond the $\mathrm{TC}^0$ complexity class, including all regular languages, as demonstrated by its ability to perform state-tracking tasks like $S_5$ permutation modeling. Evaluated within the Diffusion in RWKV-7 (DIR-7) on datasets such as LAION-5B and ImageNet, CrossWKV achieves a Frechet Inception Distance (FID) of 2.88 and a CLIP score of 0.33 on ImageNet 256x256, matching state-of-the-art performance while offering robust generalization across diverse prompts. The model's enhanced expressivity, combined with constant memory usage and linear scaling, positions it as a powerful solution for advanced cross-modal tasks, with potential applications in high-resolution generation and dynamic state manipulation.Code at https://github.com/TorchRWKV/flash-linear-attention
Chinese: CrossWKV是RWKV-7模型中的一种新型交叉注意力机制,通过单次处理整合文本和图像模态,以线性复杂度实现卓越的跨模态对齐,在文本到图像生成中达到领先性能。
English: CrossWKV is a novel cross-attention mechanism for the RWKV-7 model that enhances text-to-image generation by integrating text and image modalities in a single pass, achieving state-of-the-art performance with superior cross-modal alignment and linear complexity.

Authors:Yikun Ji, Yan Hong, Jiahui Zhan, Haoxing Chen, jun lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang
Title: Towards Explainable Fake Image Detection with Multi-Modal Large Language Models
Abstract:
Progress in image generation raises significant public security concerns. We argue that fake image detection should not operate as a "black box". Instead, an ideal approach must ensure both strong generalization and transparency. Recent progress in Multi-modal Large Language Models (MLLMs) offers new opportunities for reasoning-based AI-generated image detection. In this work, we evaluate the capabilities of MLLMs in comparison to traditional detection methods and human evaluators, highlighting their strengths and limitations. Furthermore, we design six distinct prompts and propose a framework that integrates these prompts to develop a more robust, explainable, and reasoning-driven detection system. The code is available at https://github.com/Gennadiyev/mllm-defake.
中文: 该摘要提出利用多模态大语言模型进行可解释的虚假图像检测,通过设计六种专用提示词构建新型框架,相比传统方法在鲁棒性和可解释性方面实现显著提升。
English: This abstract proposes using Multi-modal Large Language Models for explainable fake image detection, developing a framework with six specialized prompts to enhance robustness and transparency over traditional methods.

Authors:Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, Fei Wu
Title: InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Abstract:
Multimodal Large Language Models (MLLMs) have powered Graphical User Interface (GUI) Agents, showing promise in automating tasks on computing devices. Recent works have begun exploring reasoning in GUI tasks with encouraging results. However, many current approaches rely on manually designed reasoning templates, which may result in reasoning that is not sufficiently robust and adaptive for complex GUI environments. Meanwhile, some existing agents continue to operate as Reactive Actors, relying primarily on implicit reasoning that may lack sufficient depth for GUI tasks demanding planning and error recovery. We argue that advancing these agents requires a shift from reactive acting towards acting based on deliberate reasoning. To facilitate this transformation, we introduce InfiGUI-R1, an MLLM-based GUI agent developed through our Actor2Reasoner framework, a reasoning-centric, two-stage training approach designed to progressively evolve agents from Reactive Actors to Deliberative Reasoners. The first stage, Reasoning Injection, focuses on establishing a basic reasoner. We employ Spatial Reasoning Distillation to transfer cross-modal spatial reasoning capabilities from teacher models to MLLMs through trajectories with explicit reasoning steps, enabling models to integrate GUI visual-spatial information with logical reasoning before action generation. The second stage, Deliberation Enhancement, refines the basic reasoner into a deliberative one using Reinforcement Learning. This stage introduces two approaches: Sub-goal Guidance, which rewards models for generating accurate intermediate sub-goals, and Error Recovery Scenario Construction, which creates failure-and-recovery training scenarios from identified prone-to-error steps. Experimental results show InfiGUI-R1 achieves strong performance in GUI grounding and trajectory tasks. Resources at https://github.com/Reallm-Labs/InfiGUI-R1.
中文: 通过Actor2Reasoner框架开发的InfiGUI-R1代理,采用两阶段训练方法将GUI代理从反应型执行者转变为审慎型推理者,通过增强推理和错误恢复能力,在复杂GUI任务中实现更优性能。
English: The InfiGUI-R1 agent, developed through the Actor2Reasoner framework, transitions GUI agents from reactive actors to deliberative reasoners using a two-stage training approach that enhances reasoning and error recovery for improved performance in complex GUI tasks.

Authors:Junchi Yao, Shu Yang, Jianhua Xu, Lijie Hu, Mengdi Li, Di Wang
Title: Understanding the Repeat Curse in Large Language Models from a Feature Perspective
Abstract:
Large language models (LLMs) have made remarkable progress in various domains, yet they often suffer from repetitive text generation, a phenomenon we refer to as the "Repeat Curse". While previous studies have proposed decoding strategies to mitigate repetition, the underlying mechanism behind this issue remains insufficiently explored. In this work, we investigate the root causes of repetition in LLMs through the lens of mechanistic interpretability. Inspired by recent advances in Sparse Autoencoders (SAEs), which enable monosemantic feature extraction, we propose a novel approach, "Duplicatus Charm", to induce and analyze the Repeat Curse. Our method systematically identifies "Repetition Features" -the key model activations responsible for generating repetitive outputs. First, we locate the layers most involved in repetition through logit analysis. Next, we extract and stimulate relevant features using SAE-based activation manipulation. To validate our approach, we construct a repetition dataset covering token and paragraph level repetitions and introduce an evaluation pipeline to quantify the influence of identified repetition features. Furthermore, by deactivating these features, we have effectively mitigated the Repeat Curse. The source code of our work is publicly available at: https://github.com/kaustpradalab/repeat-curse-llm
中文: 本研究通过机制可解释性探究大语言模型生成重复文本的根本原因,提出"Duplicatus Charm"方法识别并停用重复特征,有效缓解"重复诅咒"问题,同时公开了源代码。
English: This study investigates the root causes of repetitive text generation in large language models through mechanistic interpretability, proposing the "Duplicatus Charm" method to identify and deactivate repetition features, effectively mitigating the "Repeat Curse" while making the source code publicly available.

Authors:Xinlin Zhuang, Jiahui Peng, Ren Ma, Yinfan Wang, Tianyi Bai, Xingjian Wei, Jiantao Qiu, Chi Zhang, Ying Qian, Conghui He
Title: Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Abstract:
The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater,a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.
中文摘要:Meta-rater作为一种多维度数据筛选方法,通过专业性、可读性、逻辑性和洁净度四个维度评估数据质量,使13亿参数模型的训练速度提升两倍,下游任务性能提高3.23分,并能扩展至72亿参数模型。
English Summary: Meta-rater is a multi-dimensional data selection method that evaluates data quality across professionalism, readability, reasoning, and cleanliness, significantly accelerating model convergence by 2x and improving downstream task performance by 3.23 points for LLMs up to 7.2B parameters.

Authors:Christopher Zhang Cui, Xingdi Yuan, Ziang Xiao, Prithviraj Ammanabrolu, Marc-Alexandre Côté
Title: TALES: Text Adventure Learning Environment Suite
Abstract:
Reasoning is an essential skill to enable Large Language Models (LLMs) to interact with the world. As tasks become more complex, they demand increasingly sophisticated and diverse reasoning capabilities for sequential decision-making, requiring structured reasoning over the context history to determine the next best action. We introduce TALES, a diverse collection of synthetic and human-written text-adventure games designed to challenge and evaluate diverse reasoning capabilities. We present results over a range of LLMs, open- and closed-weights, performing a qualitative analysis on the top performing models. Despite an impressive showing on synthetic games, even the top LLM-driven agents fail to achieve 15% on games designed for human enjoyment. Code and visualization of the experiments can be found at https://microsoft.github.io/tale-suite.

Authors:Deyu Cao, Samin Aref
Title: Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining
Abstract:
The growing use of large language models has raised environmental and economic concerns about their intensity of resource usage during inference. Serving these models to each user requires substantial energy and water for cooling. Model compression techniques like quantization can shrink large language models and make them more resource efficient at the cost of potential performance degradation. Quantization methods compress model size through replacing their high-precision parameters by quantized values of lower precision. Among existing methods, the ApiQ method achieves superior accuracy preservation at minimal memory and time overhead. We investigate two ideas to extend performance in ultra-low-bit quantization beyond ApiQ's level. First, we look into combining existing quantization-aware training techniques with ApiQ's partial training. We show that this does not outperform the baseline ApiQ method with limited training data and frozen weights. This leads to two key insights: (1) The substantial representational capacity that is gained through full retraining is unlikely to be feasible through partial training. (2) This gain may depend on using a large and diverse dataset in quantization-aware training. Second, through a novel approach informed by the two insights, we propose an ultra-low-bit quantization method that builds upon ApiQ and extends its performance without the need for full retraining. This publicly available method relies on a saliency-aware regularization term that prioritizes preserving the most impactful parameters during quantization. Our experiments on LLaMA 7B and 13B benchmarks demonstrate that our method reduces the ApiQ's accuracy degradation by 10.85% and 7.54% respectively. A Python implementation of the proposed quantization method is publicly available on GitHub https://github.com/TokuyuSou/ULB-SAPR.
中文: 本文提出了一种新型超低位量化方法,通过显著性感知正则化在无需完全重新训练的情况下,将大型语言模型的资源效率提升的同时,相比ApiQ方法减少了超过7%的精度损失。
English: A novel ultra-low-bit quantization method with saliency-aware regularization is proposed to enhance resource efficiency of large language models while reducing accuracy degradation by over 7% compared to ApiQ, without requiring full retraining.

Authors:Muhan Gao, Jash Shah, Weiqi Wang, Daniel Khashabi
Title: Science Hierarchography: Hierarchical Organization of Science Literature
Abstract:
Scientific knowledge is growing rapidly, making it difficult to track progress and high-level conceptual links across broad disciplines. While tools like citation networks and search engines help retrieve related papers, they lack the abstraction needed to capture the needed to represent the density and structure of activity across subfields. We motivate SCIENCE HIERARCHOGRAPHY, the goal of organizing scientific literature into a high-quality hierarchical structure that spans multiple levels of abstraction -- from broad domains to specific studies. Such a representation can provide insights into which fields are well-explored and which are under-explored. To achieve this goal, we develop a hybrid approach that combines efficient embedding-based clustering with LLM-based prompting, striking a balance between scalability and semantic precision. Compared to LLM-heavy methods like iterative tree construction, our approach achieves superior quality-speed trade-offs. Our hierarchies capture different dimensions of research contributions, reflecting the interdisciplinary and multifaceted nature of modern science. We evaluate its utility by measuring how effectively an LLM-based agent can navigate the hierarchy to locate target papers. Results show that our method improves interpretability and offers an alternative pathway for exploring scientific literature beyond traditional search methods. Code, data and demo are available: https://github.com/JHU-CLSP/science-hierarchography
中文: 摘要提出了科学层级图谱方法,通过结合嵌入聚类与大语言模型提示的混合技术,将科学文献组织成多层次结构,以提升可解释性并为文献探索提供传统检索之外的替代途径。
English: The abstract introduces SCIENCE HIERARCHOGRAPHY, a method that organizes scientific literature into a hierarchical structure using a hybrid approach combining embedding-based clustering and LLM-based prompting to improve interpretability and exploration beyond traditional search methods.

Authors:Shijie Xia, Yiwei Qin, Xuefeng Li, Yan Ma, Run-Ze Fan, Steffi Chern, Haoyang Zou, Fan Zhou, Xiangkun Hu, Jiahe Jin, Yanheng He, Yixin Ye, Yixiu Liu, Pengfei Liu
Title: Generative AI Act II: Test Time Scaling Drives Cognition Engineering
Abstract:
The first generation of Large Language Models - what might be called "Act I" of generative AI (2020-2023) - achieved remarkable success through massive parameter and data scaling, yet exhibited fundamental limitations such as knowledge latency, shallow reasoning, and constrained cognitive processes. During this era, prompt engineering emerged as our primary interface with AI, enabling dialogue-level communication through natural language. We now witness the emergence of "Act II" (2024-present), where models are transitioning from knowledge-retrieval systems (in latent space) to thought-construction engines through test-time scaling techniques. This new paradigm establishes a mind-level connection with AI through language-based thoughts. In this paper, we clarify the conceptual foundations of cognition engineering and explain why this moment is critical for its development. We systematically break down these advanced approaches through comprehensive tutorials and optimized implementations, democratizing access to cognition engineering and enabling every practitioner to participate in AI's second act. We provide a regularly updated collection of papers on test-time scaling in the GitHub Repository: https://github.com/GAIR-NLP/cognition-engineering
中文摘要:第一代大语言模型通过规模扩展取得成功但存在知识滞后等局限,而新兴的第二代模型通过测试时扩展技术转变为思维构建引擎,实现了与AI的思维层面连接。
English Summary: The first generation of large language models achieved success through scaling but faced limitations like knowledge latency, while the emerging second generation transitions to thought-construction engines through test-time scaling, enabling mind-level AI connections.

Authors:Chenghao Xiao, Hou Pong Chan, Hao Zhang, Mahani Aljunied, Lidong Bing, Noura Al Moubayed, Yu Rong
Title: Analyzing LLMs' Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
Abstract:
While understanding the knowledge boundaries of LLMs is crucial to prevent hallucination, research on the knowledge boundaries of LLMs has predominantly focused on English. In this work, we present the first study to analyze how LLMs recognize knowledge boundaries across different languages by probing their internal representations when processing known and unknown questions in multiple languages. Our empirical studies reveal three key findings: 1) LLMs' perceptions of knowledge boundaries are encoded in the middle to middle-upper layers across different languages. 2) Language differences in knowledge boundary perception follow a linear structure, which motivates our proposal of a training-free alignment method that effectively transfers knowledge boundary perception ability across languages, thereby helping reduce hallucination risk in low-resource languages; 3) Fine-tuning on bilingual question pair translation further enhances LLMs' recognition of knowledge boundaries across languages. Given the absence of standard testbeds for cross-lingual knowledge boundary analysis, we construct a multilingual evaluation suite comprising three representative types of knowledge boundary data. Our code and datasets are publicly available at https://github.com/DAMO-NLP-SG/LLM-Multilingual-Knowledge-Boundaries.
中文摘要:本研究首次揭示了大型语言模型在不同语言中识别知识边界的方式,发现其感知编码于模型中层,并提出无需训练的对齐方法可跨语言转移边界感知能力,有效降低低资源语言的幻觉风险。
English Summary: This study investigates how large language models perceive knowledge boundaries across different languages, revealing that such perceptions are encoded in specific model layers and can be transferred through a training-free alignment method to reduce hallucinations in low-resource languages.

Authors:Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry
Title: Learning to Attribute with Attention
Abstract:
Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .
中文摘要:提出的AT2方法通过将注意力权重作为特征进行学习,实现了语言模型中词元影响的高效归因,其性能与耗时的消融方法相当,同时显著提升了计算效率。
English Summary: The proposed AT2 method efficiently attributes token influence in language models by learning to use attention weights as features, achieving performance comparable to costly ablation-based approaches while significantly improving computational efficiency.

Authors:Paul K. Mandal, Cole Leo, Connor Hurley
Title: Controlled Territory and Conflict Tracking (CONTACT): (Geo-)Mapping Occupied Territory from Open Source Intelligence
Abstract:
Open-source intelligence provides a stream of unstructured textual data that can inform assessments of territorial control. We present CONTACT, a framework for territorial control prediction using large language models (LLMs) and minimal supervision. We evaluate two approaches: SetFit, an embedding-based few-shot classifier, and a prompt tuning method applied to BLOOMZ-560m, a multilingual generative LLM. Our model is trained on a small hand-labeled dataset of news articles covering ISIS activity in Syria and Iraq, using prompt-conditioned extraction of control-relevant signals such as military operations, casualties, and location references. We show that the BLOOMZ-based model outperforms the SetFit baseline, and that prompt-based supervision improves generalization in low-resource settings. CONTACT demonstrates that LLMs fine-tuned using few-shot methods can reduce annotation burdens and support structured inference from open-ended OSINT streams. Our code is available at https://github.com/PaulKMandal/CONTACT/.
中文摘要:CONTACT框架利用大型语言模型和最少监督从开源情报中预测领土控制情况,研究表明基于提示调优的BLOOMZ模型在低资源环境下优于小样本分类器,并能有效降低标注需求。
English Summary: The CONTACT framework utilizes large language models with minimal supervision to predict territorial control from open-source intelligence, demonstrating that prompt-tuned BLOOMZ outperforms few-shot classifiers in low-resource settings while reducing annotation needs.

Authors:Ritwik Mishra, Rajiv Ratn Shah, Ponnurangam Kumaraguru
Title: Long-context Non-factoid Question Answering in Indic Languages
Abstract:
Question Answering (QA) tasks, which involve extracting answers from a given context, are relatively straightforward for modern Large Language Models (LLMs) when the context is short. However, long contexts pose challenges due to the quadratic complexity of the self-attention mechanism. This challenge is compounded in Indic languages, which are often low-resource. This study explores context-shortening techniques, including Open Information Extraction (OIE), coreference resolution, Answer Paragraph Selection (APS), and their combinations, to improve QA performance. Compared to the baseline of unshortened (long) contexts, our experiments on four Indic languages (Hindi, Tamil, Telugu, and Urdu) demonstrate that context-shortening techniques yield an average improvement of 4\% in semantic scores and 47\% in token-level scores when evaluated on three popular LLMs without fine-tuning. Furthermore, with fine-tuning, we achieve an average increase of 2\% in both semantic and token-level scores. Additionally, context-shortening reduces computational overhead. Explainability techniques like LIME and SHAP reveal that when the APS model confidently identifies the paragraph containing the answer, nearly all tokens within the selected text receive high relevance scores. However, the study also highlights the limitations of LLM-based QA systems in addressing non-factoid questions, particularly those requiring reasoning or debate. Moreover, verbalizing OIE-generated triples does not enhance system performance. These findings emphasize the potential of context-shortening techniques to improve the efficiency and effectiveness of LLM-based QA systems, especially for low-resource languages. The source code and resources are available at https://github.com/ritwikmishra/IndicGenQA.
中文: 本研究证明,通过语境缩短技术可显著提升印度语言问答系统的性能,不仅提高了语义和词元级评分并降低计算开销,但也揭示了大语言模型在处理非事实性推理问题时的局限性。
English: This study demonstrates that context-shortening techniques significantly enhance question answering performance for Indic languages by improving semantic and token-level scores while reducing computational costs, though limitations remain with non-factoid questions.

Authors:Jianing Wang, Jin Jiang, Yang Liu, Mengdi Zhang, Xunliang Cai
Title: Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
Abstract:
In this paper, we introduce a new \emph{process prejudge} strategy in LLM reasoning to demonstrate that bootstrapping with process prejudge allows the LLM to adaptively anticipate the errors encountered when advancing the subsequent reasoning steps, similar to people sometimes pausing to think about what mistakes may occur and how to avoid them, rather than relying solely on trial and error. Specifically, we define a prejudge node in the rationale, which represents a reasoning step, with at least one step that follows the prejudge node that has no paths toward the correct answer. To synthesize the prejudge reasoning process, we present an automated reasoning framework with a dynamic tree-searching strategy. This framework requires only one LLM to perform answer judging, response critiquing, prejudge generation, and thought completion. Furthermore, we develop a two-phase training mechanism with supervised fine-tuning (SFT) and reinforcement learning (RL) to further enhance the reasoning capabilities of LLMs. Experimental results from competition-level complex reasoning demonstrate that our method can teach the model to prejudge before thinking and significantly enhance the reasoning ability of LLMs. Code and data is released at https://github.com/wjn1996/Prejudge-Before-Think.
中文: 本文提出了一种LLM推理中的"过程预判"策略,通过动态树搜索的自动化框架和两阶段训练机制,使模型能够在思考前预判潜在错误,显著提升了复杂推理能力。
English: This paper proposes a "process prejudge" strategy for LLMs that enables adaptive error anticipation during reasoning, implemented through an automated framework with dynamic tree-searching and a two-phase training mechanism, significantly boosting complex reasoning performance.

Authors:Chenwei Yan, Xiangling Fu, Yuxuan Xiong, Tianyi Wang, Siu Cheung Hui, Ji Wu, Xien Liu
Title: LLM Sensitivity Evaluation Framework for Clinical Diagnosis
Abstract:
Large language models (LLMs) have demonstrated impressive performance across various domains. However, for clinical diagnosis, higher expectations are required for LLM's reliability and sensitivity: thinking like physicians and remaining sensitive to key medical information that affects diagnostic reasoning, as subtle variations can lead to different diagnosis results. Yet, existing works focus mainly on investigating the sensitivity of LLMs to irrelevant context and overlook the importance of key information. In this paper, we investigate the sensitivity of LLMs, i.e. GPT-3.5, GPT-4, Gemini, Claude3 and LLaMA2-7b, to key medical information by introducing different perturbation strategies. The evaluation results highlight the limitations of current LLMs in remaining sensitive to key medical information for diagnostic decision-making. The evolution of LLMs must focus on improving their reliability, enhancing their ability to be sensitive to key information, and effectively utilizing this information. These improvements will enhance human trust in LLMs and facilitate their practical application in real-world scenarios. Our code and dataset are available at https://github.com/chenwei23333/DiagnosisQA.
中文: 当前大语言模型在临床诊断中对关键医学信息的敏感性存在不足,需提升其可靠性和关键信息感知能力,以增强人类信任并促进实际应用。
English: Current large language models exhibit limitations in maintaining sensitivity to crucial medical information for clinical diagnosis, necessitating improvements in reliability and key information awareness to enhance trust and practical application.

Authors:Saksham Rastogi, Pratyush Maini, Danish Pruthi
Title: STAMP Your Content: Proving Dataset Membership via Watermarked Rephrasings
Abstract:
Given how large parts of publicly available text are crawled to pretrain large language models (LLMs), data creators increasingly worry about the inclusion of their proprietary data for model training without attribution or licensing. Their concerns are also shared by benchmark curators whose test-sets might be compromised. In this paper, we present STAMP, a framework for detecting dataset membership-i.e., determining the inclusion of a dataset in the pretraining corpora of LLMs. Given an original piece of content, our proposal involves first generating multiple rephrases, each embedding a watermark with a unique secret key. One version is to be released publicly, while others are to be kept private. Subsequently, creators can compare model likelihoods between public and private versions using paired statistical tests to prove membership. We show that our framework can successfully detect contamination across four benchmarks which appear only once in the training data and constitute less than 0.001% of the total tokens, outperforming several contamination detection and dataset inference baselines. We verify that STAMP preserves both the semantic meaning and utility of the original data. We apply STAMP to two real-world scenarios to confirm the inclusion of paper abstracts and blog articles in the pretraining corpora.
中文摘要:STAMP框架通过生成带独特水印的多个改写版本,并比较公开与私有版本间的模型似然度,使数据创建者能够检测其内容是否被用于大型语言模型的预训练数据中。
English Summary: The STAMP framework enables data creators to detect if their content was used in training large language models by watermarking multiple rephrased versions and comparing model likelihoods between public and private copies.

Authors:Xiangbo Gao, Yuheng Wu, Rujia Wang, Chenxi Liu, Yang Zhou, Zhengzhong Tu
Title: LangCoop: Collaborative Driving with Language
Abstract:
Multi-agent collaboration holds great promise for enhancing the safety, reliability, and mobility of autonomous driving systems by enabling information sharing among multiple connected agents. However, existing multi-agent communication approaches are hindered by limitations of existing communication media, including high bandwidth demands, agent heterogeneity, and information loss. To address these challenges, we introduce LangCoop, a new paradigm for collaborative autonomous driving that leverages natural language as a compact yet expressive medium for inter-agent communication. LangCoop features two key innovations: Mixture Model Modular Chain-of-thought (M$^3$CoT) for structured zero-shot vision-language reasoning and Natural Language Information Packaging (LangPack) for efficiently packaging information into concise, language-based messages. Through extensive experiments conducted in the CARLA simulations, we demonstrate that LangCoop achieves a remarkable 96\% reduction in communication bandwidth (< 2KB per message) compared to image-based communication, while maintaining competitive driving performance in the closed-loop evaluation. Our project page and code are at https://xiangbogaobarry.github.io/LangCoop/.

Authors:Mehmet Hamza Erol, Batu El, Mirac Suzgun, Mert Yuksekgonul, James Zou
Title: Cost-of-Pass: An Economic Framework for Evaluating Language Models
Abstract:
The widespread adoption of AI systems in the economy hinges on their ability to generate economic value that outweighs their inference costs. Evaluating this tradeoff requires metrics that account for both performance and costs. We propose a framework grounded in production theory for evaluating language models by combining accuracy and inference cost. We introduce "cost-of-pass", the expected monetary cost of generating a correct solution. We then define the "frontier cost-of-pass" as the minimum cost-of-pass achievable across available models or the "human-expert, using the approximate cost of hiring an expert. Our analysis reveals distinct economic insights. First, lightweight models are most cost-effective for basic quantitative tasks, large models for knowledge-intensive ones, and reasoning models for complex quantitative problems, despite higher per-token costs. Second, tracking this frontier cost-of-pass over the past year reveals significant progress, particularly for complex quantitative tasks where the cost has roughly halved every few months. Third, to trace key innovations driving this progress, we examine counterfactual frontiers: estimates of cost-efficiency without specific model classes. We find that innovations in lightweight, large, and reasoning models have been essential for pushing the frontier in basic quantitative, knowledge-intensive, and complex quantitative tasks, respectively. Finally, we assess the cost-reductions afforded by common inference-time techniques like majority voting and self-refinement, finding that their marginal accuracy gains rarely justify their costs. Our findings underscore that complementary model-level innovations are the primary drivers of cost-efficiency, and our economic framework provides a principled tool for measuring this progress and guiding deployment.
中文摘要:本研究提出基于“通过成本”的经济评估框架,发现轻量级、大型和推理模型分别在基础定量、知识密集和复杂定量任务中具有最优成本效益,且模型层面的创新是推动成本效率提升的关键因素。
English Summary: The study introduces an economic framework using "cost-of-pass" metrics to evaluate AI systems, revealing that lightweight, large, and reasoning models each excel in specific tasks, with model-level innovations being the main driver of cost-efficiency improvements.

Authors:Weijie Shi, Jipeng Zhang, Yaguang Wu, Jingzhi Fang, Ruiyuan Zhang, Jiajie Xu, Jia Zhu, Hao Chen, Yao Zhao, Sirui Han, Xiaofang Zhou
Title: DIDS: Domain Impact-aware Data Sampling for Large Language Model Training
Abstract:
Large language models (LLMs) are commonly trained on multi-domain datasets, where domain sampling strategies significantly impact model performance due to varying domain importance across downstream tasks. Existing approaches for optimizing domain-level sampling strategies struggle with maintaining intra-domain consistency and accurately measuring domain impact. In this paper, we present Domain Impact-aware Data Sampling (DIDS). To ensure intra-domain consistency, a gradient clustering algorithm is proposed to group training data based on their learning effects, where a proxy language model and dimensionality reduction are employed to reduce computational overhead. To accurately measure domain impact, we develop a Fisher Information Matrix (FIM) guided metric that quantifies how domain-specific parameter updates affect the model's output distributions on downstream tasks, with theoretical guarantees. Furthermore, to determine optimal sampling ratios, DIDS combines both the FIM-guided domain impact assessment and loss learning trajectories that indicate domain-specific potential, while accounting for diminishing marginal returns. Extensive experiments demonstrate that DIDS achieves 3.4% higher average performance while maintaining comparable training efficiency. The code is available at https://github.com/shiweijiezero/DIDS.
Chinese: 本文提出的领域影响感知数据采样(DIDS)方法通过梯度聚类确保领域内一致性,并利用费舍尔信息矩阵度量领域影响,在实验中实现了3.4%的性能提升,同时保持训练效率。
English: The paper introduces Domain Impact-aware Data Sampling (DIDS), which optimizes domain sampling for large language models by using gradient clustering for intra-domain consistency and a Fisher Information Matrix metric to measure domain impact, achieving a 3.4% performance boost in experiments.

Authors:Kevin Lin, Charlie Snell, Yu Wang, Charles Packer, Sarah Wooders, Ion Stoica, Joseph E. Gonzalez
Title: Sleep-time Compute: Beyond Inference Scaling at Test-time
Abstract:
Scaling test-time compute has emerged as a key ingredient for enabling large language models (LLMs) to solve difficult problems, but comes with high latency and inference cost. We introduce sleep-time compute, which allows models to "think" offline about contexts before queries are presented: by anticipating what queries users might ask and pre-computing useful quantities, we can significantly reduce the compute requirements at test-time. To demonstrate the efficacy of our method, we create modified versions of two reasoning tasks - Stateful GSM-Symbolic and Stateful AIME. We find that sleep-time compute can reduce the amount of test-time compute needed to achieve the same accuracy by ~ 5x on Stateful GSM-Symbolic and Stateful AIME and that by scaling sleep-time compute we can further increase accuracy by up to 13% on Stateful GSM-Symbolic and 18% on Stateful AIME. Furthermore, we introduce Multi-Query GSM-Symbolic, which extends GSM-Symbolic by including multiple related queries per context. By amortizing sleep-time compute across related queries about the same context using Multi-Query GSM-Symbolic, we can decrease the average cost per query by 2.5x. We then conduct additional analysis to understand when sleep-time compute is most effective, finding the predictability of the user query to be well correlated with the efficacy of sleep-time compute. Finally, we conduct a case-study of applying sleep-time compute to a realistic agentic SWE task.
中文: 睡眠时间计算让大语言模型能够离线预计算响应,将测试时计算需求降低最多5倍,准确率最高提升18%,并通过分摊相关查询成本使单次查询开销显著减少。
English: Sleep-time compute enables large language models to pre-compute responses offline, reducing test-time compute by up to 5x and improving accuracy by up to 18% while cutting per-query costs through amortization across related queries.

Authors:Yongqian Peng, Yuxi Ma, Mengmeng Wang, Yuxuan Wang, Yizhou Wang, Chi Zhang, Yixin Zhu, Zilong Zheng
Title: Probing and Inducing Combinational Creativity in Vision-Language Models
Abstract:
The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs' outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

Authors:Han Wang, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Title: Retrieval-Augmented Generation with Conflicting Evidence
Abstract:
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. However, in practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources while also suppressing inaccurate information from noisy or irrelevant documents. Prior work has generally studied and addressed these challenges in isolation, considering only one aspect at a time, such as handling ambiguity or robustness to noise and misinformation. We instead consider multiple factors simultaneously, proposing (i) RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query, including ambiguity, misinformation, and noise; and (ii) MADAM-RAG, a multi-agent approach in which LLM agents debate over the merits of an answer over multiple rounds, allowing an aggregator to collate responses corresponding to disambiguated entities while discarding misinformation and noise, thereby handling diverse sources of conflict jointly. We demonstrate the effectiveness of MADAM-RAG using both closed and open-source models on AmbigDocs -- which requires presenting all valid answers for ambiguous queries -- improving over strong RAG baselines by up to 11.40% and on FaithEval -- which requires suppressing misinformation -- where we improve by up to 15.80% (absolute) with Llama3.3-70B-Instruct. Furthermore, we find that RAMDocs poses a challenge for existing RAG baselines (Llama3.3-70B-Instruct only obtains 32.60 exact match score). While MADAM-RAG begins to address these conflicting factors, our analysis indicates that a substantial gap remains especially when increasing the level of imbalance in supporting evidence and misinformation.
中文: 研究者提出RAMDocs数据集模拟包含歧义和错误信息的复杂检索场景,并开发MADAM-RAG多智能体辩论系统,能同时处理多种信息冲突,在基准测试中最高提升15.80%的准确率。
English: Researchers introduce RAMDocs, a dataset simulating complex retrieval scenarios with ambiguity and misinformation, and MADAM-RAG, a multi-agent debating system that jointly handles conflicting information while improving accuracy over baselines by up to 15.80%.

Authors:Ebrahim Norouzi, Sven Hertling, Harald Sack
Title: ConExion: Concept Extraction with Large Language Models
Abstract:
In this paper, an approach for concept extraction from documents using pre-trained large language models (LLMs) is presented. Compared with conventional methods that extract keyphrases summarizing the important information discussed in a document, our approach tackles a more challenging task of extracting all present concepts related to the specific domain, not just the important ones. Through comprehensive evaluations of two widely used benchmark datasets, we demonstrate that our method improves the F1 score compared to state-of-the-art techniques. Additionally, we explore the potential of using prompts within these models for unsupervised concept extraction. The extracted concepts are intended to support domain coverage evaluation of ontologies and facilitate ontology learning, highlighting the effectiveness of LLMs in concept extraction tasks. Our source code and datasets are publicly available at https://github.com/ISE-FIZKarlsruhe/concept_extraction.
中文: 本文提出了一种利用预训练大语言模型从文档中提取所有领域相关概念的方法,相比现有技术提升了F1值,并通过提示探索无监督提取以支持本体评估和学习。
English: This paper introduces a method using pre-trained large language models to extract all domain-related concepts from documents, showing improved F1 scores over existing techniques and exploring unsupervised extraction via prompts to aid ontology evaluation and learning.

Authors:Guanrou Yang, Chen Yang, Qian Chen, Ziyang Ma, Wenxi Chen, Wen Wang, Tianrui Wang, Yifan Yang, Zhikang Niu, Wenrui Liu, Fan Yu, Zhihao Du, Zhifu Gao, ShiLiang Zhang, Xie Chen
Title: EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting
Abstract:
Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Dataset, code, checkpoints, and demo samples are available at https://github.com/yanghaha0908/EmoVoice.
中文:EmoVoice是一种新颖的情感可控语音合成模型,利用大语言模型实现细粒度的自然语言情感控制,并通过音素增强设计提高内容一致性,在英文和中文测试集上均取得了最先进的性能表现。
English: EmoVoice is an emotion-controllable TTS model that utilizes large language models for fine-grained natural language emotion control and a phoneme boost design to enhance content consistency, achieving state-of-the-art performance on both English and Chinese test sets.

Authors:Xue Wen Tan, Stanley Kok
Title: SMARTe: Slot-based Method for Accountable Relational Triple extraction
Abstract:
Relational Triple Extraction (RTE) is a fundamental task in Natural Language Processing (NLP). However, prior research has primarily focused on optimizing model performance, with limited efforts to understand the internal mechanisms driving these models. Many existing methods rely on complex preprocessing to induce specific interactions, often resulting in opaque systems that may not fully align with their theoretical foundations. To address these limitations, we propose SMARTe: a Slot-based Method for Accountable Relational Triple extraction. SMARTe introduces intrinsic interpretability through a slot attention mechanism and frames the task as a set prediction problem. Slot attention consolidates relevant information into distinct slots, ensuring all predictions can be explicitly traced to learned slot representations and the tokens contributing to each predicted relational triple. While emphasizing interpretability, SMARTe achieves performance comparable to state-of-the-art models. Evaluations on the NYT and WebNLG datasets demonstrate that adding interpretability does not compromise performance. Furthermore, we conducted qualitative assessments to showcase the explanations provided by SMARTe, using attention heatmaps that map to their respective tokens. We conclude with a discussion of our findings and propose directions for future research. Our code is available at https://github.com/Chen-XueWen/SMARTe.
中文摘要:SMARTe是一种基于槽注意力的可解释关系三元组抽取方法,通过将信息整合至可追溯的槽表示中,在保持与先进模型相当性能的同时实现了内在可解释性。
English Summary: SMARTe is an interpretable relational triple extraction method that uses slot attention to consolidate information into traceable representations while maintaining performance comparable to state-of-the-art models.

Authors:Yicheng Pan, Zhenrong Zhang, Pengfei Hu, Jiefeng Ma, Jun Du, Jianshu Zhang, Quan Liu, Jianqing Gao, Feng Ma
Title: Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration
Abstract:
Recent advances in Multimodal Large Language Models (MLLMs) have achieved remarkable progress in general domains and demonstrated promise in multimodal mathematical reasoning. However, applying MLLMs to geometry problem solving (GPS) remains challenging due to lack of accurate step-by-step solution data and severe hallucinations during reasoning. In this paper, we propose GeoGen, a pipeline that can automatically generates step-wise reasoning paths for geometry diagrams. By leveraging the precise symbolic reasoning, \textbf{GeoGen} produces large-scale, high-quality question-answer pairs. To further enhance the logical reasoning ability of MLLMs, we train \textbf{GeoLogic}, a Large Language Model (LLM) using synthetic data generated by GeoGen. Serving as a bridge between natural language and symbolic systems, GeoLogic enables symbolic tools to help verifying MLLM outputs, making the reasoning process more rigorous and alleviating hallucinations. Experimental results show that our approach consistently improves the performance of MLLMs, achieving remarkable results on benchmarks for geometric reasoning tasks. This improvement stems from our integration of the strengths of LLMs and symbolic systems, which enables a more reliable and interpretable approach for the GPS task. Codes are available at https://github.com/ycpNotFound/GeoGen.
中文: 本文提出GeoGen自动生成几何问题的逐步推理数据,并训练GeoLogic模型将符号系统融入多模态大语言模型,以增强几何推理能力、减少幻觉现象并显著提升任务表现。
English: This paper introduces GeoGen, a pipeline that generates step-by-step reasoning data for geometry problems, and GeoLogic, a model trained on this data to enhance multimodal large language models' geometric reasoning by integrating symbolic systems, reducing hallucinations and improving performance.

Authors:Haidar Khan, Hisham A. Alyahya, Yazeed Alnumay, M Saiful Bari, Bülent Yener
Title: ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
Abstract:
Evaluating the capabilities of Large Language Models (LLMs) has traditionally relied on static benchmark datasets, human assessments, or model-based evaluations - methods that often suffer from overfitting, high costs, and biases. ZeroSumEval is a novel competition-based evaluation protocol that leverages zero-sum games to assess LLMs with dynamic benchmarks that resist saturation. ZeroSumEval encompasses a diverse suite of games, including security challenges (PyJail), classic games (Chess, Liar's Dice, Poker), knowledge tests (MathQuiz), and persuasion challenges (Gandalf, Debate). These games are designed to evaluate a range of AI capabilities such as strategic reasoning, planning, knowledge application, and creativity. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework. To demonstrate this, we conduct extensive experiments with >7000 simulations across 7 games and 13 models. Our results show that while frontier models from the GPT and Claude families can play common games and answer questions, they struggle to play games that require creating novel and challenging questions. We also observe that models cannot reliably jailbreak each other and fail generally at tasks requiring creativity. We release our code at https://github.com/facebookresearch/ZeroSumEval.
Chinese: ZeroSumEval提出了一种基于零和博弈的竞争性评估协议,通过动态基准测试大语言模型,发现尽管它们在常规任务中表现良好,但在创造性和新颖问题解决方面存在明显不足。
English: ZeroSumEval introduces a competition-based evaluation protocol using zero-sum games to dynamically assess Large Language Models, revealing their limitations in creativity and novel problem-solving despite proficiency in common tasks.

Authors:Negar Arabzadeh, Charles L. A. Clarke
Title: Benchmarking LLM-based Relevance Judgment Methods
Abstract:
Large Language Models (LLMs) are increasingly deployed in both academic and industry settings to automate the evaluation of information seeking systems, particularly by generating graded relevance judgments. Previous work on LLM-based relevance assessment has primarily focused on replicating graded human relevance judgments through various prompting strategies. However, there has been limited exploration of alternative assessment methods or comprehensive comparative studies. In this paper, we systematically compare multiple LLM-based relevance assessment methods, including binary relevance judgments, graded relevance assessments, pairwise preference-based methods, and two nugget-based evaluation methods~--~document-agnostic and document-dependent. In addition to a traditional comparison based on system rankings using Kendall correlations, we also examine how well LLM judgments align with human preferences, as inferred from relevance grades. We conduct extensive experiments on datasets from three TREC Deep Learning tracks 2019, 2020 and 2021 as well as the ANTIQUE dataset, which focuses on non-factoid open-domain question answering. As part of our data release, we include relevance judgments generated by both an open-source (Llama3.2b) and a commercial (gpt-4o) model. Our goal is to \textit{reproduce} various LLM-based relevance judgment methods to provide a comprehensive comparison. All code, data, and resources are publicly available in our GitHub Repository at https://github.com/Narabzad/llm-relevance-judgement-comparison.
中文: 本文系统比较了多种基于大语言模型的相关性评估方法,包括二元判断和成对偏好等,通过多个数据集验证其与人工评估的一致性,并提供全面的对比分析。
English: This paper systematically compares multiple LLM-based relevance assessment methods, including binary judgments and pairwise preferences, across multiple datasets to evaluate their alignment with human judgments and provide comprehensive comparative insights.

Authors:Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, Amelia Glaese
Title: BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents
Abstract:
We present BrowseComp, a simple yet challenging benchmark for measuring the ability for agents to browse the web. BrowseComp comprises 1,266 questions that require persistently navigating the internet in search of hard-to-find, entangled information. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers. BrowseComp for browsing agents can be seen as analogous to how programming competitions are an incomplete but useful benchmark for coding agents. While BrowseComp sidesteps challenges of a true user query distribution, like generating long answers or resolving ambiguity, it measures the important core capability of exercising persistence and creativity in finding information. BrowseComp can be found at https://github.com/openai/simple-evals.
中文: BrowseComp是一个简洁而具有挑战性的基准测试,通过1,266个可验证的简答题来评估网络浏览代理持续查找复杂关联信息的能力。
English: BrowseComp is a straightforward yet demanding benchmark designed to assess web browsing agents' ability to persistently locate complex, interconnected information through 1,266 verifiable short-answer questions.

Authors:Negar Arabzadeh, Charles L. A . Clarke
Title: A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment
Abstract:
Large Language Models (LLMs) are increasingly used to automate relevance judgments for information retrieval (IR) tasks, often demonstrating agreement with human labels that approaches inter-human agreement. To assess the robustness and reliability of LLM-based relevance judgments, we systematically investigate impact of prompt sensitivity on the task. We collected prompts for relevance assessment from 15 human experts and 15 LLMs across three tasks~ -- ~binary, graded, and pairwise~ -- ~yielding 90 prompts in total. After filtering out unusable prompts from three humans and three LLMs, we employed the remaining 72 prompts with three different LLMs as judges to label document/query pairs from two TREC Deep Learning Datasets (2020 and 2021). We compare LLM-generated labels with TREC official human labels using Cohen's $κ$ and pairwise agreement measures. In addition to investigating the impact of prompt variations on agreement with human labels, we compare human- and LLM-generated prompts and analyze differences among different LLMs as judges. We also compare human- and LLM-generated prompts with the standard UMBRELA prompt used for relevance assessment by Bing and TREC 2024 Retrieval Augmented Generation (RAG) Track. To support future research in LLM-based evaluation, we release all data and prompts at https://github.com/Narabzad/prompt-sensitivity-relevance-judgements/.
中文摘要:大型语言模型在信息检索中越来越多地用于自动化相关性判断,其与人类标注的一致性接近人类间一致性,本研究系统评估了提示敏感性在不同任务和数据集上对模型鲁棒性和可靠性的影响。
English Summary: Large Language Models (LLMs) are increasingly used for automated relevance judgments in information retrieval, showing agreement with human labels that nears inter-human agreement, while this study systematically evaluates the impact of prompt sensitivity on their robustness and reliability across various tasks and datasets.

Authors:Nay Myat Min, Long H. Pham, Yige Li, Jun Sun
Title: Propaganda via AI? A Study on Semantic Backdoors in Large Language Models
Abstract:
Large language models (LLMs) demonstrate remarkable performance across myriad language tasks, yet they remain vulnerable to backdoor attacks, where adversaries implant hidden triggers that systematically manipulate model outputs. Traditional defenses focus on explicit token-level anomalies and therefore overlook semantic backdoors-covert triggers embedded at the conceptual level (e.g., ideological stances or cultural references) that rely on meaning-based cues rather than lexical oddities. We first show, in a controlled finetuning setting, that such semantic backdoors can be implanted with only a small poisoned corpus, establishing their practical feasibility. We then formalize the notion of semantic backdoors in LLMs and introduce a black-box detection framework, RAVEN (short for "Response Anomaly Vigilance for uncovering semantic backdoors"), which combines semantic entropy with cross-model consistency analysis. The framework probes multiple models with structured topic-perspective prompts, clusters the sampled responses via bidirectional entailment, and flags anomalously uniform outputs; cross-model comparison isolates model-specific anomalies from corpus-wide biases. Empirical evaluations across diverse LLM families (GPT-4o, Llama, DeepSeek, Mistral) uncover previously undetected semantic backdoors, providing the first proof-of-concept evidence of these hidden vulnerabilities and underscoring the urgent need for concept-level auditing of deployed language models. We open-source our code and data at https://github.com/NayMyatMin/RAVEN.
中文摘要:大型语言模型易受基于概念触发器的语义后门攻击,为此研发的RAVEN检测框架通过跨模型一致性分析,成功在多种模型中发现了这类隐蔽漏洞。
English Summary: Large language models are susceptible to semantic backdoor attacks using conceptual triggers, prompting the development of RAVEN, a detection framework that successfully identifies these hidden vulnerabilities across multiple models.

Authors:Xiangju Li, Dong Yang, Xiaogang Zhu, Faliang Huang, Peng Zhang, Zhongying Zhao
Title: Span-level Emotion-Cause-Category Triplet Extraction with Instruction Tuning LLMs and Data Augmentation
Abstract:
Span-level emotion-cause-category triplet extraction represents a novel and complex challenge within emotion cause analysis. This task involves identifying emotion spans, cause spans, and their associated emotion categories within the text to form structured triplets. While prior research has predominantly concentrated on clause-level emotion-cause pair extraction and span-level emotion-cause detection, these methods often confront challenges originating from redundant information retrieval and difficulty in accurately determining emotion categories, particularly when emotions are expressed implicitly or ambiguously. To overcome these challenges, this study explores a fine-grained approach to span-level emotion-cause-category triplet extraction and introduces an innovative framework that leverages instruction tuning and data augmentation techniques based on large language models. The proposed method employs task-specific triplet extraction instructions and utilizes low-rank adaptation to fine-tune large language models, eliminating the necessity for intricate task-specific architectures. Furthermore, a prompt-based data augmentation strategy is developed to address data scarcity by guiding large language models in generating high-quality synthetic training data. Extensive experimental evaluations demonstrate that the proposed approach significantly outperforms existing baseline methods, achieving at least a 12.8% improvement in span-level emotion-cause-category triplet extraction metrics. The results demonstrate the method's effectiveness and robustness, offering a promising avenue for advancing research in emotion cause analysis. The source code is available at https://github.com/zxgnlp/InstruDa-LLM.
中文: 本研究提出了一种基于指令调优大语言模型和数据增强的创新框架,显著提升了跨度级情感-原因-类别三元组提取性能,比现有方法至少提高了12.8%的指标。
English: This study introduces a novel framework using instruction-tuned large language models and data augmentation to significantly improve span-level emotion-cause-category triplet extraction, achieving over 12.8% better performance than existing methods.

Authors:Pei Liu, Xin Liu, Ruoyu Yao, Junming Liu, Siyuan Meng, Ding Wang, Jun Ma
Title: HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation
Abstract:
While Retrieval-Augmented Generation (RAG) augments Large Language Models (LLMs) with external knowledge, conventional single-agent RAG remains fundamentally limited in resolving complex queries demanding coordinated reasoning across heterogeneous data ecosystems. We present HM-RAG, a novel Hierarchical Multi-agent Multimodal RAG framework that pioneers collaborative intelligence for dynamic knowledge synthesis across structured, unstructured, and graph-based data. The framework is composed of three-tiered architecture with specialized agents: a Decomposition Agent that dissects complex queries into contextually coherent sub-tasks via semantic-aware query rewriting and schema-guided context augmentation; Multi-source Retrieval Agents that carry out parallel, modality-specific retrieval using plug-and-play modules designed for vector, graph, and web-based databases; and a Decision Agent that uses consistency voting to integrate multi-source answers and resolve discrepancies in retrieval results through Expert Model Refinement. This architecture attains comprehensive query understanding by combining textual, graph-relational, and web-derived evidence, resulting in a remarkable 12.95% improvement in answer accuracy and a 3.56% boost in question classification accuracy over baseline RAG systems on the ScienceQA and CrisisMMD benchmarks. Notably, HM-RAG establishes state-of-the-art results in zero-shot settings on both datasets. Its modular architecture ensures seamless integration of new data modalities while maintaining strict data governance, marking a significant advancement in addressing the critical challenges of multimodal reasoning and knowledge synthesis in RAG systems. Code is available at https://github.com/ocean-luna/HMRAG.
Chinese: HM-RAG提出了一种分层多智能体框架,通过分解复杂查询并整合多源异构数据来增强多模态推理能力,相比传统RAG系统在多个基准测试中实现了显著准确率提升。
English: HM-RAG introduces a hierarchical multi-agent framework that enhances multimodal reasoning by decomposing complex queries and integrating diverse data sources, achieving significant accuracy improvements over conventional RAG systems.

Authors:Jialun Zhong, Wei Shen, Yanzeng Li, Songyang Gao, Hua Lu, Yicheng Chen, Yang Zhang, Wei Zhou, Jinjie Gu, Lei Zou
Title: A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Abstract:
Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs' behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github\footnote{https://github.com/JLZhong23/awesome-reward-models}.
Chinese: 本文全面综述了奖励模型的研究进展、应用及挑战,旨在为初学者提供系统指导并推动该领域的未来发展。
English: This paper offers a comprehensive overview of reward models, detailing their development, applications, and challenges to serve as a foundational guide for beginners and future research.

Authors:Xin Gao, Qizhi Pei, Zinan Tang, Yu Li, Honglin Lin, Jiang Wu, Lijun Wu, Conghui He
Title: A Strategic Coordination Framework of Small LLMs Matches Large LLMs in Data Synthesis
Abstract:
While data synthesis and distillation are promising strategies to enhance small language models, current approaches heavily rely on Large Language Models (LLMs), which suffer from high computational costs, environmental inefficiency, and potential biases inherited from monolithic architectures. In contrast, smaller LLMs are more accessible and sustainable, but their individual capabilities often fall short in generating high-quality, diverse, and reliable data. Inspired by collaborative human processes (e.g., peer review), we propose a multiple small LLMs involved framework, GRA, that aggregates specialized roles across small LLMs to iterative refinement and quality control typically achieved by a single large LLM. In this collaborative framework, multiple small LLMs assume distinct roles-Generator, Reviewer, and Adjudicator-to simulate a peer-review-inspired data synthesis pipeline. The Generator proposes initial data samples, the Reviewer critiques their quality and diversity, and the Adjudicator resolves conflicts to finalize the output. By decomposing the synthesis process into specialized sub-tasks, collaborative small LLMs can achieve data-level parity with large LLM-based distillation. Through experiments across multiple benchmarks, we demonstrate that GRA-produced data matches or exceeds the quality of single large LLM outputs, e.g., Qwen-2.5-72B-Instruct. Our results challenge the necessity of monolithic large models for high-quality data synthesis, advocating instead for strategic coordination of smaller agents. Our datasets, models, and code are publicly available at https://github.com/GX-XinGao/GRA.
中文摘要:GRA框架通过模拟同行评审流程,使多个专业化的小型语言模型协作生成高质量数据,在保持高效可持续的同时,实现了与大型模型相当的性能表现。
English Summary: The GRA framework enables multiple specialized small language models to collaboratively generate high-quality data through a peer-review-inspired process, achieving performance comparable to large models while being more efficient and sustainable.

Authors:Siyan Zhao, Devaansh Gupta, Qinqing Zheng, Aditya Grover
Title: d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Abstract:
Recent large language models (LLMs) have demonstrated strong reasoning capabilities that benefits from online reinforcement learning (RL). These capabilities have primarily been demonstrated within the left-to-right autoregressive (AR) generation paradigm. In contrast, non-autoregressive paradigms based on diffusion generate text in a coarse-to-fine manner. Although recent diffusion-based large language models (dLLMs) have achieved competitive language modeling performance compared to their AR counterparts, it remains unclear if dLLMs can also leverage recent advances in LLM reasoning. To this end, we propose d1, a framework to adapt pre-trained masked dLLMs into reasoning models via a combination of supervised finetuning (SFT) and RL. Specifically, we develop and extend techniques to improve reasoning in pretrained dLLMs: (a) we utilize a masked SFT technique to distill knowledge and instill self-improvement behavior directly from existing datasets, and (b) we introduce a novel critic-free, policy-gradient based RL algorithm called diffu-GRPO, the first integration of policy gradient methods to masked dLLMs. Through empirical studies, we investigate the performance of different post-training recipes on multiple mathematical and planning benchmarks. We find that d1 yields the best performance and significantly improves performance of a state-of-the-art dLLM. Our code is released at https://dllm-reasoning.github.io/.

Authors:Xanh Ho, Jiahao Huang, Florian Boudin, Akiko Aizawa
Title: LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Abstract:
Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.22 (EM) and 0.40 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.
中文: 本研究证明,使用大型语言模型作为阅读理解问答模型的评估工具,能显著提升与人类判断的相关性,优于传统的精确匹配和F1分数指标。
English: This study demonstrates that using large language models as judges for evaluating reading comprehension QA models significantly improves correlation with human judgments, outperforming traditional metrics like EM and F1-score.

Authors:Bingjie Gao, Xinyu Gao, Xiaoxue Wu, Yujie Zhou, Yu Qiao, Li Niu, Xinyuan Chen, Yaohui Wang
Title: The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
Abstract:
The evolution of Text-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. To this end, we introduce RAPO, a novel Retrieval-Augmented Prompt Optimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts.
中文: RAPO是一种新颖的检索增强提示优化框架,通过双重优化分支改进用户提示,使其与训练提示分布对齐,从而提升生成视频的静态与动态质量。
English: RAPO is a novel retrieval-augmented prompt optimization framework that refines user prompts through dual optimization branches to enhance video generation quality by aligning them with training prompt distributions.

Authors:Haokun Liu, Sicong Huang, Jingyu Hu, Yangqiaoyu Zhou, Chenhao Tan
Title: HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation
Abstract:
There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address this, we introduce HypoBench, a novel benchmark designed to evaluate LLMs and hypothesis generation methods across multiple aspects, including practical utility, generalizability, and hypothesis discovery rate. HypoBench includes 7 real-world tasks and 5 synthetic tasks with 194 distinct datasets. We evaluate four state-of-the-art LLMs combined with six existing hypothesis-generation methods. Overall, our results suggest that existing methods are capable of discovering valid and novel patterns in the data. However, the results from synthetic datasets indicate that there is still significant room for improvement, as current hypothesis generation methods do not fully uncover all relevant or meaningful patterns. Specifically, in synthetic settings, as task difficulty increases, performance significantly drops, with best models and methods only recovering 38.8% of the ground-truth hypotheses. These findings highlight challenges in hypothesis generation and demonstrate that HypoBench serves as a valuable resource for improving AI systems designed to assist scientific discovery.

Authors:Leon Guertler, Bobby Cheng, Simon Yu, Bo Liu, Leshem Choshen, Cheston Tan
Title: TextArena
Abstract:
TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player setups) and allows for easy evaluation of model capabilities via an online-play system (against humans and other submitted models) with real-time TrueSkill scores. Traditional benchmarks rarely assess dynamic social skills such as negotiation, theory of mind, and deception, creating a gap that TextArena addresses. Designed with research, community and extensibility in mind, TextArena emphasizes ease of adding new games, adapting the framework, testing models, playing against the models, and training models. Detailed documentation of environments, games, leaderboard, and examples are available on https://github.com/LeonGuertler/TextArena and https://www.textarena.ai/.
中文: TextArena是一个开源的文本竞技游戏集合,包含57+种独特环境,专门用于训练和评估大语言模型的动态社交能力(如谈判与欺骗),通过可扩展框架和实时评分系统弥补传统基准测试的不足。
English: TextArena is an open-source platform featuring over 57 competitive text-based games designed to train and evaluate social skills like negotiation and deception in LLMs, addressing gaps in traditional benchmarks through its extensible framework and real-time scoring system.

Authors:Xue Zhang, Songming Zhang, Yunlong Liang, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Title: A Dual-Space Framework for General Knowledge Distillation of Large Language Models
Abstract:
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the teacher model and the student model to transfer more information. However, we reveal that the current white-box KD framework exhibits two limitations: a) bridging probability distributions from different output spaces will limit the similarity between the teacher model and the student model; b) this framework cannot be applied to LLMs with different vocabularies. One of the root causes for these limitations is that the distributions from the teacher and the student for KD are output by different prediction heads, which yield distributions in different output spaces and dimensions. Therefore, in this paper, we propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD. Specifically, we first introduce two projectors with ideal initialization to project the teacher/student hidden states into the student/teacher representation spaces. After this, the hidden states from different models can share the same head and unify the output spaces of the distributions. Furthermore, we develop an exact token alignment (ETA) algorithm to align the same tokens in two differently-tokenized sequences. Based on the above, our DSKD framework is a general KD framework that supports both off-policy and on-policy KD, and KD between any two LLMs regardless of their vocabularies. Extensive experiments on instruction-following, mathematical reasoning, and code generation benchmarks show that DSKD significantly outperforms existing methods based on the current white-box KD framework and surpasses other cross-tokenizer KD methods for LLMs with different vocabularies.
Chinese: 本文提出了一种双空间知识蒸馏(DSKD)框架,通过投影隐藏状态和对齐标记来统一师生模型的输出空间,实现了不同词汇表大语言模型间的有效知识迁移,并在多个基准测试中显著优于现有方法。
English: This paper introduces a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of teacher and student models by projecting hidden states and aligning tokens, enabling effective distillation between large language models with different vocabularies and outperforming existing methods.

Authors:Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor
Title: Offline Learning and Forgetting for Reasoning with Large Language Models
Abstract:
Leveraging inference-time search in large language models has proven effective in further enhancing a trained model's capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model's search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown reasoning benchmarks show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.
Chinese: 该方法通过利用搜索生成的成功与失败路径对大型语言模型进行微调,在显著提升推理成功率的同时,大幅降低了推理时间,优于传统搜索方法。
English: The proposed method enhances reasoning in large language models by fine-tuning them with search-generated successful and failed paths, significantly improving success rates and reducing inference time compared to traditional search-based approaches.

Authors:Xinyi Liu, Xiaoyi Zhang, Ziyun Zhang, Yan Lu
Title: UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
Abstract:
Recent advancements in Large Vision-Language Models are accelerating the development of Graphical User Interface (GUI) agents that utilize human-like vision perception capabilities to enhance productivity on digital devices. Compared to approaches predicated on GUI metadata, which are platform-dependent and vulnerable to implementation variations, vision-based approaches offer broader applicability. In this vision-based paradigm, the GUI instruction grounding, which maps user instruction to the location of corresponding element on the given screenshot, remains a critical challenge, particularly due to limited public training dataset and resource-intensive manual instruction data annotation. In this paper, we delve into unexplored challenges in this task including element-to-screen ratio, unbalanced element type, and implicit instruction. To address these challenges, we introduce a large-scale data synthesis pipeline UI-E2I-Synth for generating varying complex instruction datasets using GPT-4o instead of human annotators. Furthermore, we propose a new GUI instruction grounding benchmark UI-I2E-Bench, which is designed to address the limitations of existing benchmarks by incorporating diverse annotation aspects. Our model, trained on the synthesized data, achieves superior performance in GUI instruction grounding, demonstrating the advancements of proposed data synthesis pipeline. The proposed benchmark, accompanied by extensive analyses, provides practical insights for future research in GUI grounding. We will release corresponding artifacts at https://microsoft.github.io/FIVE-UI-Evol/ .

Authors:René Peinl
Title: Using LLMs as prompt modifier to avoid biases in AI image generators
Abstract:
This study examines how Large Language Models (LLMs) can reduce biases in text-to-image generation systems by modifying user prompts. We define bias as a model's unfair deviation from population statistics given neutral prompts. Our experiments with Stable Diffusion XL, 3.5 and Flux demonstrate that LLM-modified prompts significantly increase image diversity and reduce bias without the need to change the image generators themselves. While occasionally producing results that diverge from original user intent for elaborate prompts, this approach generally provides more varied interpretations of underspecified requests rather than superficial variations. The method works particularly well for less advanced image generators, though limitations persist for certain contexts like disability representation. All prompts and generated images are available at https://iisys-hof.github.io/llm-prompt-img-gen/

Authors:Sukannya Purkayastha, Zhuang Li, Anne Lauscher, Lizhen Qu, Iryna Gurevych
Title: LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews
Abstract:
Peer review is a cornerstone of quality control in scientific publishing. With the increasing workload, the unintended use of `quick' heuristics, referred to as lazy thinking, has emerged as a recurring issue compromising review quality. Automated methods to detect such heuristics can help improve the peer-reviewing process. However, there is limited NLP research on this issue, and no real-world dataset exists to support the development of detection tools. This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Our analysis reveals that Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. However, instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points, highlighting the importance of high-quality training data. Furthermore, a controlled experiment demonstrates that reviews revised with lazy thinking feedback are more comprehensive and actionable than those written without such feedback. We will release our dataset and the enhanced guidelines that can be used to train junior reviewers in the community. (Code available here: https://github.com/UKPLab/acl2025-lazy-review)
中文: 本研究推出了LazyReview数据集用于检测同行评审中的惰性思维,证明经过微调的大语言模型能显著提升检测效果,且基于此类分析的反馈能有效提高评审质量。
English: This study introduces LazyReview, a dataset for detecting lazy thinking in peer reviews, showing that fine-tuned LLMs significantly improve detection and that feedback based on such analysis enhances review quality.

Authors:Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du
Title: Dynamic Compressing Prompts for Efficient Inference of Large Language Models
Abstract:
Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.
中文: 本文提出动态压缩提示(LLM-DCP)方法,通过将提示压缩建模为马尔可夫决策过程,结合奖励函数和分层训练策略,在保持性能的同时逐步去除冗余标记,尤其在较高压缩率下优于现有技术。
English: This paper introduces Dynamic Compressing Prompts (LLM-DCP), a task-agnostic method that models prompt compression as a Markov Decision Process to sequentially remove redundant tokens while preserving performance through a reward function and hierarchical training strategy, outperforming existing techniques especially at high compression rates.

Authors:Changjiang Gao, Hankun Lin, Shujian Huang, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Jiajun Chen
Title: Understanding LLMs' Cross-Lingual Context Retrieval: How Good It Is And Where It Comes From
Abstract:
The ability of cross-lingual context retrieval is a fundamental aspect of cross-lingual alignment of large language models (LLMs), where the model extracts context information in one language based on requests in another language. Despite its importance in real-life applications, this ability has not been adequately investigated for state-of-the-art models. In this paper, we evaluate the cross-lingual context retrieval ability of over 40 LLMs across 12 languages to understand the source of this ability, using cross-lingual machine reading comprehension (xMRC) as a representative scenario. Our results show that several small, post-trained open LLMs show strong cross-lingual context retrieval ability, comparable to closed-source LLMs such as GPT-4o, and their estimated oracle performances greatly improve after post-training. Our interpretability analysis shows that the cross-lingual context retrieval process can be divided into two main phases: question encoding and answer retrieval, which are formed in pre-training and post-training, respectively. The phasing stability correlates with xMRC performance, and the xMRC bottleneck lies at the last model layers in the second phase, where the effect of post-training can be evidently observed. Our results also indicate that larger-scale pretraining cannot improve the xMRC performance. Instead, larger LLMs need further multilingual post-training to fully unlock their cross-lingual context retrieval potential. Our code and is available at https://github.com/NJUNLP/Cross-Lingual-Context-Retrieval
中文: 本研究评估了40多个大语言模型在12种语言中的表现,发现经过后训练的小型开源模型在跨语言上下文检索能力上可媲美GPT-4o,其性能依赖于预训练阶段形成的分层处理机制,且需要通过多语言后训练而非扩大预训练规模来充分释放潜力。
English: This study evaluates over 40 large language models across 12 languages, revealing that small post-trained open models match GPT-4o's cross-lingual context retrieval ability, with performance relying on phased processes formed during pre-training and enhanced through multilingual post-training rather than larger pretraining scales.

Authors:Yize Zhang, Tianshu Wang, Sirui Chen, Kun Wang, Xingyu Zeng, Hongyu Lin, Xianpei Han, Le Sun, Chaochao Lu
Title: ARise: Towards Knowledge-Augmented Reasoning via Risk-Adaptive Search
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities and are receiving increasing attention to enhance their reasoning through scaling test--time compute. However, their application in open--ended, knowledge--intensive, complex reasoning scenarios is still limited. Reasoning--oriented methods struggle to generalize to open--ended scenarios due to implicit assumptions of complete world knowledge. Meanwhile, knowledge--augmented reasoning (KAR) methods fail to address two core challenges: 1) error propagation, where errors in early steps cascade through the chain, and 2) verification bottleneck, where the explore--exploit tradeoff arises in multi--branch decision processes. To overcome these limitations, we introduce ARise, a novel framework that integrates risk assessment of intermediate reasoning states with dynamic retrieval--augmented generation (RAG) within a Monte Carlo tree search paradigm. This approach enables effective construction and optimization of reasoning plans across multiple maintained hypothesis branches. Experimental results show that ARise significantly outperforms the state--of--the--art KAR methods by up to 23.10%, and the latest RAG-equipped large reasoning models by up to 25.37%. Our project page is at https://opencausalab.github.io/ARise.
Chinese: ARise是一种新颖框架,通过将风险评估和动态检索增强生成结合到蒙特卡洛树搜索中,显著提升了大语言模型的推理能力,其性能比现有最优方法高出最多25.37%。
English: ARise is a novel framework that enhances reasoning in large language models by integrating risk assessment and dynamic retrieval-augmented generation within a Monte Carlo tree search, significantly outperforming existing methods by up to 25.37%.

Authors:Jessica Lin, Amir Zeldes
Title: GUM-SAGE: A Novel Dataset and Approach for Graded Entity Salience Prediction
Abstract:
Determining and ranking the most salient entities in a text is critical for user-facing systems, especially as users increasingly rely on models to interpret long documents they only partially read. Graded entity salience addresses this need by assigning entities scores that reflect their relative importance in a text. Existing approaches fall into two main categories: subjective judgments of salience, which allow for gradient scoring but lack consistency, and summarization-based methods, which define salience as mention-worthiness in a summary, promoting explainability but limiting outputs to binary labels (entities are either summary-worthy or not). In this paper, we introduce a novel approach for graded entity salience that combines the strengths of both approaches. Using an English dataset spanning 12 spoken and written genres, we collect 5 summaries per document and calculate each entity's salience score based on its presence across these summaries. Our approach shows stronger correlation with scores based on human summaries and alignments, and outperforms existing techniques, including LLMs. We release our data and code at https://github.com/jl908069/gum_sum_salience to support further research on graded salient entity extraction.
中文摘要:本文提出了一种新颖的实体显著性分级方法,通过结合主观评分和基于摘要的方法,利用多篇摘要计算实体重要性,在多种文本类型中展现出优于现有技术的性能表现。
English Summary: This paper introduces a novel graded entity salience method that combines subjective scoring and summarization-based approaches, achieving superior performance over existing techniques by calculating entity importance through multiple summaries across diverse genres.

Authors:Ankit Kumar Shaw, Kun Jiang, Tuopu Wen, Chandan Kumar Sah, Yining Shi, Mengmeng Yang, Diange Yang, Xiaoli Lian
Title: CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates
Abstract:
The rapid growth of intelligent connected vehicles (ICVs) and integrated vehicle-road-cloud systems has increased the demand for accurate, real-time HD map updates. However, ensuring map reliability remains challenging due to inconsistencies in crowdsourced data, which suffer from motion blur, lighting variations, adverse weather, and lane marking degradation. This paper introduces CleanMAP, a Multimodal Large Language Model (MLLM)-based distillation framework designed to filter and refine crowdsourced data for high-confidence HD map updates. CleanMAP leverages an MLLM-driven lane visibility scoring model that systematically quantifies key visual parameters, assigning confidence scores (0-10) based on their impact on lane detection. A novel dynamic piecewise confidence-scoring function adapts scores based on lane visibility, ensuring strong alignment with human evaluations while effectively filtering unreliable data. To further optimize map accuracy, a confidence-driven local map fusion strategy ranks and selects the top-k highest-scoring local maps within an optimal confidence range (best score minus 10%), striking a balance between data quality and quantity. Experimental evaluations on a real-world autonomous vehicle dataset validate CleanMAP's effectiveness, demonstrating that fusing the top three local maps achieves the lowest mean map update error of 0.28m, outperforming the baseline (0.37m) and meeting stringent accuracy thresholds (<= 0.32m). Further validation with real-vehicle data confirms 84.88% alignment with human evaluators, reinforcing the model's robustness and reliability. This work establishes CleanMAP as a scalable and deployable solution for crowdsourced HD map updates, ensuring more precise and reliable autonomous navigation. The code will be available at https://Ankit-Zefan.github.io/CleanMap/
中文摘要:CleanMAP是一个基于多模态大语言模型的蒸馏框架,通过评估车道可见性并融合高质量局部地图来优化众包高清地图更新,在自动驾驶导航中实现了卓越的精度和与人工评估的高度一致性。
English Summary: CleanMAP is a multimodal large language model-based framework that refines crowdsourced data for high-definition map updates by scoring lane visibility and fusing top-quality local maps, achieving superior accuracy and human alignment in autonomous navigation.

Authors:Nafis Sadeq, Xin Xu, Zhouhang Xie, Julian McAuley, Byungkyu Kang, Prarit Lamba, Xiang Gao
Title: Improving In-Context Learning with Reasoning Distillation
Abstract:
Language models rely on semantic priors to perform in-context learning, which leads to poor performance on tasks involving inductive reasoning. Instruction-tuning methods based on imitation learning can superficially enhance the in-context learning performance of language models, but they often fail to improve the model's understanding of the underlying rules that connect inputs and outputs in few-shot demonstrations. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models. Through a careful combination of data augmentation, filtering, supervised fine-tuning, and alignment, ReDis achieves significant performance improvements across a diverse range of tasks, including 1D-ARC, List Function, ACRE, and MiniSCAN. Experiments on three language model backbones show that ReDis outperforms equivalent few-shot prompting baselines across all tasks and even surpasses the teacher model, GPT-4o, in some cases. ReDis, based on the LLaMA-3 backbone, achieves relative improvements of 23.2%, 2.8%, and 66.6% over GPT-4o on 1D-ARC, ACRE, and MiniSCAN, respectively, within a similar hypothesis search space. The code, dataset, and model checkpoints will be made available at https://github.com/NafisSadeq/reasoning-distillation.git.
中文: ReDis是一种推理蒸馏技术,通过数据增强和微调提升语言模型的归纳推理能力,在多项任务中表现卓越,部分情况下甚至超越了GPT-4o。
English: ReDis is a reasoning distillation technique that enhances language models' inductive reasoning through data augmentation and fine-tuning, achieving superior performance across multiple tasks and even surpassing GPT-4o in some cases.

Authors:Zhe Wang, Fangtian Fu, Wei Zhang, Lige Yan, Yan Meng, Jianping Wu, Hui Wu, Gang Xu, Si Chen
Title: BioChemInsight: An Open-Source Toolkit for Automated Identification and Recognition of Optical Chemical Structures and Activity Data in Scientific Publications
Abstract:
Automated extraction of chemical structures and their bioactivity data is crucial for accelerating drug discovery and enabling data-driven pharmaceutical research. Existing optical chemical structure recognition (OCSR) tools fail to autonomously associate molecular structures with their bioactivity profiles, creating a critical bottleneck in structure-activity relationship (SAR) analysis. Here, we present BioChemInsight, an open-source pipeline that integrates: (1) DECIMER Segmentation and MolVec for chemical structure recognition, (2) Qwen2.5-VL-32B for compound identifier association, and (3) PaddleOCR with Gemini-2.0-flash for bioactivity extraction and unit normalization. We evaluated the performance of BioChemInsight on 25 patents and 17 articles. BioChemInsight achieved 95% accuracy for tabular patent data (structure/identifier recognition), with lower accuracy in non-tabular patents (~80% structures, ~75% identifiers), plus 92.2 % bioactivity extraction accuracy. For articles, it attained >99% identifiers and 78-80% structure accuracy in non-tabular formats, plus 97.4% bioactivity extraction accuracy. The system generates ready-to-use SAR datasets, reducing data preprocessing time from weeks to hours while enabling applications in high-throughput screening and ML-driven drug design (https://github.com/dahuilangda/BioChemInsight).
中文:BioChemInsight是一个开源流程,能够自动从专利和文献中提取化学结构及其生物活性数据,准确率高,通过生成可直接使用的构效关系数据集,大幅加速药物发现进程。
English: BioChemInsight is an open-source pipeline that automates the extraction of chemical structures and their bioactivity data from patents and articles, achieving high accuracy and significantly accelerating drug discovery by generating ready-to-use SAR datasets.

Authors:Yijun Liang, Ming Li, Chenrui Fan, Ziyue Li, Dang Nguyen, Kwesi Cobbina, Shweta Bhardwaj, Jiuhai Chen, Fuxiao Liu, Tianyi Zhou
Title: ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness
Abstract:
Color plays an important role in human perception and usually provides critical clues in visual reasoning. However, it is unclear whether and how vision-language models (VLMs) can perceive, understand, and leverage color as humans. This paper introduces ColorBench, an innovative benchmark meticulously crafted to assess the capabilities of VLMs in color understanding, including color perception, reasoning, and robustness. By curating a suite of diverse test scenarios, with grounding in real applications, ColorBench evaluates how these models perceive colors, infer meanings from color-based cues, and maintain consistent performance under varying color transformations. Through an extensive evaluation of 32 VLMs with varying language models and vision encoders, our paper reveals some undiscovered findings: (i) The scaling law (larger models are better) still holds on ColorBench, while the language model plays a more important role than the vision encoder. (ii) However, the performance gaps across models are relatively small, indicating that color understanding has been largely neglected by existing VLMs. (iii) CoT reasoning improves color understanding accuracies and robustness, though they are vision-centric tasks. (iv) Color clues are indeed leveraged by VLMs on ColorBench but they can also mislead models in some tasks. These findings highlight the critical limitations of current VLMs and underscore the need to enhance color comprehension. Our ColorBenchcan serve as a foundational tool for advancing the study of human-level color understanding of multimodal AI.
中文摘要:ColorBench是一个评估视觉语言模型颜色理解能力的新基准,揭示了现有模型在颜色感知、推理和鲁棒性方面存在显著不足,尽管模型规模扩大和思维链推理能带来一定提升。
English Summary: ColorBench is a novel benchmark that evaluates vision-language models' color understanding, revealing significant limitations in their ability to perceive, reason with, and maintain robustness regarding colors, despite scaling laws and CoT reasoning offering some improvements.

Authors:Ning Li, Jingran Zhang, Justin Cui
Title: ArXivBench: When You Should Avoid Using ChatGPT for Academic Writing
Abstract:
Large language models (LLMs) demonstrate strong capabilities in reasoning and question answering, yet their tendency to generate factually incorrect content remains a critical challenge. This study evaluates proprietary and open-source LLMs on generating relevant research papers with accurate arXiv links. Our evaluation reveals critical academic risks: LLMs frequently generate incorrect arXiv links or references to non-existent papers, fundamentally undermining their ability to properly attribute research contributions to the actual authors. We introduce arXivBench, a benchmark specifically designed to assess LLM performance across eight major subject categories on arXiv and five subfields within computer science, one of the most popular categories among them. Our findings show concerning accuracy variations across subjects, with Claude-3.5-Sonnet exhibiting a substantial advantage in generating both relevant and accurate responses. Notably, most LLMs perform significantly better in Artificial Intelligence than other subfields. This benchmark provides a standardized tool for evaluating LLM reliability in scientific contexts, promoting more dependable academic use in research environments. Our code and dataset are available at https://github.com/liningresearch/arXivBench and https://huggingface.co/datasets/arXivBenchLLM/arXivBench.
中文: 本研究推出arXivBench基准测试,揭示大语言模型常生成错误arXiv链接和虚假参考文献,其中Claude-3.5-Sonnet在人工智能领域表现最佳,凸显了其在学术应用中的严重可靠性问题。
English: This study introduces arXivBench, a benchmark revealing that large language models often produce inaccurate arXiv links and references, with Claude-3.5-Sonnet showing superior accuracy particularly in Artificial Intelligence, highlighting critical reliability concerns in academic applications.

Authors:Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
Title: MIEB: Massive Image Embedding Benchmark
Abstract:
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
中文摘要:我们推出了大规模图像嵌入基准(MIEB),通过涵盖38种语言的130项任务全面评估图像及图文嵌入模型,发现没有单一模型能在所有类别中表现卓越,同时揭示了先进视觉模型在文本视觉表征方面的优势及其在混合编码和干扰环境下图文匹配的局限性。
English Summary: The Massive Image Embedding Benchmark (MIEB) is introduced to comprehensively evaluate image and image-text embedding models across 130 tasks in 38 languages, revealing that no single model excels in all categories while uncovering both strengths and limitations in advanced vision models.

Authors:Suyu Ye, Haojun Shi, Darren Shih, Hyokun Yun, Tanya Roosta, Tianmin Shu
Title: RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users
Abstract:
To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following in the real world poses significant challenges beyond performing a single, clearly defined task. For instance, real-world human instructions can be ambiguous, require different levels of AI assistance, and may evolve over time, reflecting changes in the user's mental state. To address this gap, we introduce RealWebAssist, a novel benchmark designed to evaluate sequential instruction-following in realistic scenarios involving long-horizon interactions with the web, visual GUI grounding, and understanding ambiguous real-world user instructions. RealWebAssist includes a dataset of sequential instructions collected from real-world human users. Each user instructs a web-based assistant to perform a series of tasks on multiple websites. A successful agent must reason about the true intent behind each instruction, keep track of the mental state of the user, understand user-specific routines, and ground the intended tasks to actions on the correct GUI elements. Our experimental results show that state-of-the-art models struggle to understand and ground user instructions, posing critical challenges in following real-world user instructions for long-horizon web assistance.
中文: RealWebAssist 是一个新颖的基准测试,旨在评估AI代理处理现实世界中连续、模糊用户指令的能力,当前模型在意图推理和图形界面定位方面仍面临显著挑战。
English: RealWebAssist is a new benchmark designed to evaluate AI agents' ability to handle sequential, ambiguous real-world user instructions for long-horizon web tasks, where current models struggle with intent reasoning and GUI grounding.

Authors:Haoran Hao, Jiaming Han, Yiyuan Zhang, Xiangyu Yue
Title: Multimodal Long Video Modeling Based on Temporal Dynamic Context
Abstract:
Recent advances in Large Language Models (LLMs) have led to significant breakthroughs in video understanding. However, existing models still struggle with long video processing due to the context length constraint of LLMs and the vast amount of information within the video. Although some recent methods are designed for long video understanding, they often lose crucial information during token compression and struggle with additional modality like audio. In this work, we propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC). Firstly, we segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. Secondly, we propose a novel temporal context compressor to reduce the number of tokens within each segment. Specifically, we employ a query-based Transformer to aggregate video, audio, and instruction text tokens into a limited set of temporal context tokens. Finally, we feed the static frame tokens and the temporal context tokens into the LLM for video understanding. Furthermore, to handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments. These intermediate answers serve as part of the reasoning process and contribute to the final answer. We conduct extensive experiments on general video understanding and audio-video understanding benchmarks, where our method demonstrates strong performance. The code and models are available at https://github.com/Hoar012/TDC-Video.
中文: 本文提出时序动态上下文(TDC)方法,通过将视频分割为语义场景、使用时序上下文压缩器减少标记数量,并采用思维链策略处理超长视频,在视频与音频理解基准测试中表现优异。
English: This paper introduces Temporal Dynamic Context (TDC), a dynamic long video encoding method that segments videos into scenes, compresses tokens using a temporal context compressor, and employs a chain-of-thought strategy for enhanced video and audio understanding, achieving strong performance on benchmarks.

Authors:Michał Turski, Mateusz Chiliński, Łukasz Borchmann
Title: Unchecked and Overlooked: Addressing the Checkbox Blind Spot in Large Language Models with CheckboxQA
Abstract:
Checkboxes are critical in real-world document processing where the presence or absence of ticks directly informs data extraction and decision-making processes. Yet, despite the strong performance of Large Vision and Language Models across a wide range of tasks, they struggle with interpreting checkable content. This challenge becomes particularly pressing in industries where a single overlooked checkbox may lead to costly regulatory or contractual oversights. To address this gap, we introduce the CheckboxQA dataset, a targeted resource designed to evaluate and improve model performance on checkbox-related tasks. It reveals the limitations of current models and serves as a valuable tool for advancing document comprehension systems, with significant implications for applications in sectors such as legal tech and finance. The dataset is publicly available at: https://github.com/Snowflake-Labs/CheckboxQA
中文: CheckboxQA数据集的推出旨在解决大型视觉与语言模型在复选框识别上的不足,成为提升法律科技和金融等领域文档处理能力的重要工具。
English: The CheckboxQA dataset is introduced to address the limitations of Large Vision and Language Models in interpreting checkboxes, serving as a critical tool for improving document processing in fields like legal tech and finance.

Authors:Parshin Shojaee, Ngoc-Hieu Nguyen, Kazem Meidani, Amir Barati Farimani, Khoa D Doan, Chandan K Reddy
Title: LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models
Abstract:
Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorized forms, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy. These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.
中文摘要:LLM-SRBench是一个新颖的基准测试,通过涵盖四个科学领域的239个挑战性问题来严格评估大语言模型在科学方程发现中的能力,其设计能有效防止记忆效应,结果显示当前最优方法仅达31.5%的准确率,凸显了该领域的研究挑战。
English Summary: LLM-SRBench is a novel benchmark designed to rigorously evaluate LLMs' scientific equation discovery capabilities by preventing memorization through 239 challenging problems across four domains, revealing that current methods achieve only 31.5% accuracy and underscoring the field's difficulties.

Authors:Weiqi Wang, Jiefu Ou, Yangqiu Song, Benjamin Van Durme, Daniel Khashabi
Title: Can LLMs Generate Tabular Summaries of Science Papers? Rethinking the Evaluation Protocol
Abstract:
Literature review tables are essential for summarizing and comparing collections of scientific papers. We explore the task of generating tables that best fulfill a user's informational needs given a collection of scientific papers. Building on recent work (Newman et al., 2024), we extend prior approaches to address real-world complexities through a combination of LLM-based methods and human annotations. Our contributions focus on three key challenges encountered in real-world use: (i) User prompts are often under-specified; (ii) Retrieved candidate papers frequently contain irrelevant content; and (iii) Task evaluation should move beyond shallow text similarity techniques and instead assess the utility of inferred tables for information-seeking tasks (e.g., comparing papers). To support reproducible evaluation, we introduce ARXIV2TABLE, a more realistic and challenging benchmark for this task, along with a novel approach to improve literature review table generation in real-world scenarios. Our extensive experiments on this benchmark show that both open-weight and proprietary LLMs struggle with the task, highlighting its difficulty and the need for further advancements. Our dataset and code are available at https://github.com/JHU-CLSP/arXiv2Table.
中文: 本研究通过结合大语言模型方法与人工标注,解决了用户查询模糊和内容不相关等现实难题,提出了ARXIV2TABLE基准测试,实验表明现有模型在此任务上仍有明显不足。
English: This research advances literature review table generation by addressing real-world challenges like vague user queries and irrelevant content through a novel LLM-based approach and introduces the ARXIV2TABLE benchmark, revealing current models' limitations despite improvements.

Authors:Tianjie Ju, Zhenyu Shao, Bowen Wang, Yujia Chen, Zhuosheng Zhang, Hao Fei, Mong-Li Lee, Wynne Hsu, Sufeng Duan, Gongshen Liu
Title: Probing then Editing Response Personality of Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated promising capabilities to generate responses that simulate consistent personality traits. Despite the major attempts to analyze personality expression through output-based evaluations, little is known about how such traits are internally encoded within LLM parameters. In this paper, we introduce a layer-wise probing framework to systematically investigate the layer-wise capability of LLMs in simulating personality for responding. We conduct probing experiments on 11 open-source LLMs over the PersonalityEdit benchmark and find that LLMs predominantly simulate personality for responding in their middle and upper layers, with instruction-tuned models demonstrating a slightly clearer separation of personality traits. Furthermore, by interpreting the trained probing hyperplane as a layer-wise boundary for each personality category, we propose a layer-wise perturbation method to edit the personality expressed by LLMs during inference. Our results show that even when the prompt explicitly specifies a particular personality, our method can still successfully alter the response personality of LLMs. Interestingly, the difficulty of converting between certain personality traits varies substantially, which aligns with the representational distances in our probing experiments. Finally, we conduct a comprehensive MMLU benchmark evaluation and time overhead analysis, demonstrating that our proposed personality editing method incurs only minimal degradation in general capabilities while maintaining low training costs and acceptable inference latency. Our code is publicly available at https://github.com/universe-sky/probing-then-editing-personality.
中文: 本研究提出分层探测框架,发现大语言模型主要在中间和上层模拟人格特征,并提出一种有效的扰动方法,可在保持模型通用能力的同时编辑人格表达。
English: This study introduces a layer-wise probing framework revealing that large language models primarily simulate personality traits in middle and upper layers, and proposes an effective perturbation method to edit these traits with minimal impact on general capabilities.

Authors:Soumyadeep Pal, Changsheng Wang, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
Title: LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks
Abstract:
Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior by removing undesired data-model influences from the pretrained model while preserving general utility. Significant recent efforts have been dedicated to developing LLM unlearning benchmarks such as WMDP (Weapons of Mass Destruction Proxy) and MUSE (Machine Unlearning Six-way Evaluation), facilitating standardized unlearning performance assessment and method comparison. Despite their usefulness, we uncover for the first time a novel coreset effect within these benchmarks. Specifically, we find that LLM unlearning achieved with the original (full) forget set can be effectively maintained using a significantly smaller subset (functioning as a "coreset"), e.g., as little as 5% of the forget set, even when selected at random. This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime. We demonstrate that this coreset effect remains strong, regardless of the LLM unlearning method used, such as NPO (Negative Preference Optimization) and RMU (Representation Misdirection Unlearning), the popular ones in these benchmarks. The surprisingly strong coreset effect is also robust across various data selection methods, ranging from random selection to more sophisticated heuristic approaches. We explain the coreset effect in LLM unlearning through a keyword-based perspective, showing that keywords extracted from the forget set alone contribute significantly to unlearning effectiveness and indicating that current unlearning is driven by a compact set of high-impact tokens rather than the entire dataset. We further justify the faithfulness of coreset-unlearned models along additional dimensions, such as mode connectivity and robustness to jailbreaking attacks. Codes are available at https://github.com/OPTML-Group/MU-Coreset.
中文: 大语言模型遗忘可以通过仅使用遗忘集中极小部分(如5%)作为核心集有效实现,这归因于高影响力关键词而非整个数据集的作用,且该效应在不同方法和数据选择策略中均表现稳健。
English: Large language model unlearning can be effectively achieved using a surprisingly small subset of the forget set, known as a coreset, as minimal as 5%, due to the influence of high-impact keywords rather than the entire dataset, with this effect being robust across various methods and data selection approaches.

Authors:Zhaopeng Feng, Shaosheng Cao, Jiahan Ren, Jiayuan Su, Ruizhe Chen, Yan Zhang, Zhe Xu, Yao Hu, Jian Wu, Zuozhu Liu
Title: MT-R1-Zero: Advancing LLM-based Machine Translation via R1-Zero-like Reinforcement Learning
Abstract:
Large-scale reinforcement learning (RL) methods have proven highly effective in enhancing the reasoning abilities of large language models (LLMs), particularly for tasks with verifiable solutions such as mathematics and coding. However, applying this idea to machine translation (MT), where outputs are flexibly formatted and difficult to automatically evaluate with explicit rules, remains underexplored. In this work, we introduce MT-R1-Zero, the first open-source adaptation of the R1-Zero RL framework for MT without supervised fine-tuning or cold-start. We propose a rule-metric mixed reward mechanism to guide LLMs towards improved translation quality via emergent reasoning. On the WMT 24 English-Chinese benchmark, our MT-R1-Zero-3B-Mix achieves competitive performance, surpassing TowerInstruct-7B-v0.2 by an average of 1.26 points. Meanwhile, our MT-R1-Zero-7B-Mix attains a high average score of 62.25 across all metrics, placing it on par with advanced proprietary models such as GPT-4o and Claude-3.5-Sonnet, while the MT-R1-Zero-7B-Sem variant achieves state-of-the-art scores on semantic metrics. Moreover, our work exhibits strong generalization capabilities on out-of-distribution MT tasks, robustly supporting multilingual and low-resource settings. Extensive analysis of model behavior across different initializations and reward metrics offers pioneering insight into the critical role of reward design, LLM adaptability, training dynamics, and emergent reasoning patterns within the R1-Zero paradigm for MT. Our code is available at https://github.com/fzp0424/MT-R1-Zero.
中文摘要:MT-R1-Zero框架通过规则与指标混合奖励机制,成功将强化学习应用于机器翻译领域,在实现与先进模型相媲美性能的同时,展现出在多语言和低资源场景下的强大泛化能力。
English Summary: The MT-R1-Zero framework successfully adapts reinforcement learning to machine translation by using a rule-metric mixed reward mechanism, achieving competitive performance against advanced models while demonstrating strong generalization in multilingual and low-resource settings.

Authors:Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He
Title: Breaking the Data Barrier -- Building GUI Agents Through Task Generalization
Abstract:
Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.
中文: GUI代理面临高质量数据稀缺的挑战,通过在中期训练阶段让视觉语言模型学习多样化推理任务,可显著提升其在图形界面规划场景中的泛化能力和性能表现。
English: GUI agents face data scarcity issues, but training Vision Language Models on diverse reasoning tasks during mid-training significantly enhances their generalization and performance across GUI planning scenarios.

Authors:Nitya Thakkar, Mert Yuksekgonul, Jake Silberg, Animesh Garg, Nanyun Peng, Fei Sha, Rose Yu, Carl Vondrick, James Zou
Title: Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025
Abstract:
Peer review at AI conferences is stressed by rapidly rising submission volumes, leading to deteriorating review quality and increased author dissatisfaction. To address these issues, we developed Review Feedback Agent, a system leveraging multiple large language models (LLMs) to improve review clarity and actionability by providing automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. Implemented at ICLR 2025 as a large randomized control study, our system provided optional feedback to more than 20,000 randomly selected reviews. To ensure high-quality feedback for reviewers at this scale, we also developed a suite of automated reliability tests powered by LLMs that acted as guardrails to ensure feedback quality, with feedback only being sent to reviewers if it passed all the tests. The results show that 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers. This suggests that many reviewers found the AI-generated feedback sufficiently helpful to merit updating their reviews. Incorporating AI feedback led to significantly longer reviews (an average increase of 80 words among those who updated after receiving feedback) and more informative reviews, as evaluated by blinded researchers. Moreover, reviewers who were selected to receive AI feedback were also more engaged during paper rebuttals, as seen in longer author-reviewer discussions. This work demonstrates that carefully designed LLM-generated review feedback can enhance peer review quality by making reviews more specific and actionable while increasing engagement between reviewers and authors. The Review Feedback Agent is publicly available at https://github.com/zou-group/review_feedback_agent.
中文: 该评审反馈代理系统利用大语言模型为同行评审提供自动反馈,通过在ICLR 2025的大规模实验证明,能显著提升评审质量、增加评审长度并促进审稿人参与度。
English: The Review Feedback Agent uses large language models to provide automated feedback on peer reviews, significantly improving review quality, length, and reviewer engagement as demonstrated in a large-scale ICLR 2025 study.

Authors:Zhenting Wang, Guofeng Cui, Yu-Jhe Li, Kun Wan, Wentian Zhao
Title: DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training
Abstract:
Recent advances in reinforcement learning (RL)-based post-training have led to notable improvements in large language models (LLMs), particularly in enhancing their reasoning capabilities to handle complex tasks. However, most existing methods treat the training data as a unified whole, overlooking the fact that modern LLM training often involves a mixture of data from diverse distributions-varying in both source and difficulty. This heterogeneity introduces a key challenge: how to adaptively schedule training across distributions to optimize learning efficiency. In this paper, we present a principled curriculum learning framework grounded in the notion of distribution-level learnability. Our core insight is that the magnitude of policy advantages reflects how much a model can still benefit from further training on a given distribution. Based on this, we propose a distribution-level curriculum learning framework for RL-based LLM post-training, which leverages the Upper Confidence Bound (UCB) principle to dynamically adjust sampling probabilities for different distrubutions. This approach prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration), yielding an adaptive and theoretically grounded training schedule. We instantiate our curriculum learning framework with GRPO as the underlying RL algorithm and demonstrate its effectiveness on logic reasoning datasets with multiple difficulties and sources. Our experiments show that our framework significantly improves convergence speed and final performance, highlighting the value of distribution-aware curriculum strategies in LLM post-training. Code: https://github.com/ZhentingWang/DUMP.
中文: 本文提出了一种课程学习框架,通过策略优势和置信上界原则自适应地调度不同数据分布的强化学习后训练,从而显著提升大语言模型的收敛速度与最终性能。
English: This paper introduces a curriculum learning framework that adaptively schedules training across diverse data distributions in reinforcement learning-based post-training of large language models, using policy advantages and the Upper Confidence Bound principle to enhance convergence speed and performance.

Authors:Jixiao Zhang, Chunsheng Zuo
Title: GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models
Abstract:
Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.
中文摘要:GRPO-LEAD通过引入长度正则化奖励、显式错误惩罚和难度感知优势重加权,显著提升了数学推理的准确性与简洁性,在140亿参数模型中实现了最优性能。
English Summary: GRPO-LEAD enhances mathematical reasoning by introducing length-regularized rewards, explicit error penalties, and difficulty-aware advantage reweighting, achieving state-of-the-art performance in accuracy and conciseness for 14B-scale models.

Authors:Jiahao Qiu, Yinghui He, Xinzhe Juan, Yimin Wang, Yuhan Liu, Zixin Yao, Yue Wu, Xun Jiang, Ling Yang, Mengdi Wang
Title: EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
Abstract:
The rise of LLM-driven AI characters raises safety concerns, particularly for vulnerable human users with psychological disorders. To address these risks, we propose EmoAgent, a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions. EmoAgent comprises two components: EmoEval simulates virtual users, including those portraying mentally vulnerable individuals, to assess mental health changes before and after interactions with AI characters. It uses clinically proven psychological and psychiatric assessment tools (PHQ-9, PDI, PANSS) to evaluate mental risks induced by LLM. EmoGuard serves as an intermediary, monitoring users' mental status, predicting potential harm, and providing corrective feedback to mitigate risks. Experiments conducted in popular character-based chatbots show that emotionally engaging dialogues can lead to psychological deterioration in vulnerable users, with mental state deterioration in more than 34.4% of the simulations. EmoGuard significantly reduces these deterioration rates, underscoring its role in ensuring safer AI-human interactions. Our code is available at: https://github.com/1akaman/EmoAgent
中文摘要:EmoAgent框架通过EmoEval组件评估AI交互导致的心理健康风险,并利用EmoGuard实时监测干预,实验证明能显著降低脆弱用户群体34.4%以上的心理状态恶化率。
English Summary: The EmoAgent framework addresses mental health risks in human-AI interactions by using EmoEval to assess psychological deterioration through clinical tools and EmoGuard to monitor and mitigate harm, significantly reducing deterioration rates in vulnerable users.

Authors:Zhehao Dong, Zhen Lu, Yue Yang
Title: Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations
Abstract:
Configuring computational fluid dynamics (CFD) simulations typically demands extensive domain expertise, limiting broader access. Although large language models (LLMs) have advanced scientific computing, their use in automating CFD workflows is underdeveloped. We introduce a novel approach centered on domain-specific LLM adaptation. By fine-tuning Qwen2.5-7B-Instruct on NL2FOAM, our custom dataset of 28716 natural language-to-OpenFOAM configuration pairs with chain-of-thought (CoT) annotations, we enable direct translation from natural language descriptions to executable CFD setups. A multi-agent framework orchestrates the process, autonomously verifying inputs, generating configurations, running simulations, and correcting errors. Evaluation on a benchmark of 21 diverse flow cases demonstrates state-of-the-art performance, achieving 88.7% solution accuracy and 82.6% first-attempt success rate. This significantly outperforms larger general-purpose models like Qwen2.5-72B-Instruct, DeepSeek-R1, and Llama3.3-70B-Instruct, while also requiring fewer correction iterations and maintaining high computational efficiency. The results highlight the critical role of domain-specific adaptation in deploying LLM assistants for complex engineering workflows. Our code and fine-tuned model have been deposited at https://github.com/YYgroup/AutoCFD.
中文: 本研究通过领域特定的微调大语言模型方法,实现了从自然语言到CFD仿真的自动化配置,在显著超越通用大模型的同时保持了高精度与高效率。
English: This study introduces a domain-specific fine-tuned LLM approach that automates CFD simulation setup through natural language translation, achieving state-of-the-art accuracy and efficiency while outperforming larger general-purpose models.

Authors:Chenghao Li, Chaoning Zhang, Yi Lu, Jiaquan Zhang, Qigan Sun, Xudong Wang, Jiwei Wei, Guoqing Wang, Yang Yang, Heng Tao Shen
Title: Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution
Abstract:
Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps, mimicking human logic and reducing errors. However, complex tasks with vast solution spaces and vague constraints often exceed the capacity of a single reasoning chain. Inspired by Minimal Free Resolution (MFR) in commutative algebra and algebraic geometry, we propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths. SoT captures deeper logical dependencies, enabling more robust and structured problem-solving. MFR decomposes a module into a sequence of free modules with minimal rank, providing a structured analytical approach to complex systems. This method introduces the concepts of "Module", "Betti numbers","Freeness", "Mapping", "Exactness" and "Minimality", enabling the systematic decomposition of the original complex problem into logically complete minimal subproblems while preserving key problem features and reducing reasoning length. We tested SoT across diverse datasets (e.g., GSM8K, MATH) and models (e.g., GPT-4o-mini, Qwen2.5), achieving inference accuracy that matches or surpasses mainstream CoTs standards. Additionally, by aligning the sampling process with algebraic constraints, our approach enhances the scalability of inference time in LLMs, ensuring both transparent reasoning and high performance. Our code will be publicly available at https://github.com/dlMARiA/Syzygy-of-thoughts.
Chinese: 受极小自由分解启发,Syzygy of Thoughts (SoT) 通过引入相互关联的推理路径扩展了思维链提示,在多个数据集和模型上实现了更稳健的问题解决能力和更高的推理准确率。
English: Syzygy of Thoughts (SoT) extends Chain-of-Thought prompting by introducing interrelated reasoning paths inspired by Minimal Free Resolution, enabling more robust problem-solving and improved inference accuracy across diverse datasets and models.

Authors:Chen Sun, Renat Aksitov, Andrey Zhmoginov, Nolan Andrew Miller, Max Vladymyrov, Ulrich Rueckert, Been Kim, Mark Sandler
Title: How new data permeates LLM knowledge and how to dilute it
Abstract:
Large language models learn and continually learn through the accumulation of gradient-based updates, but how individual pieces of new information affect existing knowledge, leading to both beneficial generalization and problematic hallucination, remains poorly understood. We demonstrate that when learning new information, LLMs exhibit a "priming" effect: learning a new fact can cause the model to inappropriately apply that knowledge in unrelated contexts. To systematically study this phenomenon, we introduce "Outlandish," a carefully curated dataset of 1320 diverse text samples designed to probe how new knowledge permeates through an LLM's existing knowledge base. Using this dataset, we show that the degree of priming after learning new information can be predicted by measuring the token probability of key words before learning. This relationship holds robustly across different model architectures (PALM-2, Gemma, Llama), sizes, and training stages. Finally, we develop two novel techniques to modulate how new knowledge affects existing model behavior: (1) a ``stepping-stone'' text augmentation strategy and (2) an ``ignore-k'' update pruning method. These approaches reduce undesirable priming effects by 50-95\% while preserving the model's ability to learn new information. Our findings provide both empirical insights into how LLMs learn and practical tools for improving the specificity of knowledge insertion in language models. Further materials: https://sunchipsster1.github.io/projects/outlandish/

Authors:Sharanya Dasgupta, Sujoy Nath, Arkaprabha Basu, Pourya Shamsolmoali, Swagatam Das
Title: HalluShift: Measuring Distribution Shifts towards Hallucination Detection in LLMs
Abstract:
Large Language Models (LLMs) have recently garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains. However, LLMs often suffer from the inherent limitation of hallucinations and generate incorrect information while maintaining well-structured and coherent responses. In this work, we hypothesize that hallucinations stem from the internal dynamics of LLMs. Our observations indicate that, during passage generation, LLMs tend to deviate from factual accuracy in subtle parts of responses, eventually shifting toward misinformation. This phenomenon bears a resemblance to human cognition, where individuals may hallucinate while maintaining logical coherence, embedding uncertainty within minor segments of their speech. To investigate this further, we introduce an innovative approach, HalluShift, designed to analyze the distribution shifts in the internal state space and token probabilities of the LLM-generated responses. Our method attains superior performance compared to existing baselines across various benchmark datasets. Our codebase is available at https://github.com/sharanya-dasgupta001/hallushift.
Chinese: 本研究提出HalluShift方法,通过分析大语言模型内部状态空间和标记概率的分布偏移来检测其产生的幻觉,在多个基准测试中表现优于现有基线。
English: This study introduces HalluShift, a novel method that detects hallucinations in Large Language Models by analyzing shifts in their internal state space and token probabilities, achieving superior performance across multiple benchmarks.

Authors:Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang
Title: ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
Abstract:
Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.
Chinese: ClinicalGPT-R1作为基于2万份临床记录训练的推理增强大语言模型,在中文诊断任务中超越GPT-4o,英文场景与GPT-4表现相当,其卓越诊断能力已通过MedBench-Hard基准验证。
English: ClinicalGPT-R1, a reasoning-enhanced LLM trained on 20,000 clinical records, surpasses GPT-4o in Chinese diagnostic tasks and matches GPT-4 in English, as validated on the challenging MedBench-Hard dataset.

Authors:Matt Grenander, Siddharth Varia, Paula Czarnowska, Yogarshi Vyas, Kishaloy Halder, Bonan Min
Title: Exploration of Plan-Guided Summarization for Narrative Texts: the Case of Small Language Models
Abstract:
Plan-guided summarization attempts to reduce hallucinations in small language models (SLMs) by grounding generated summaries to the source text, typically by targeting fine-grained details such as dates or named entities. In this work, we investigate whether plan-based approaches in SLMs improve summarization in long document, narrative tasks. Narrative texts' length and complexity often mean they are difficult to summarize faithfully. We analyze existing plan-guided solutions targeting fine-grained details, and also propose our own higher-level, narrative-based plan formulation. Our results show that neither approach significantly improves on a baseline without planning in either summary quality or faithfulness. Human evaluation reveals that while plan-guided approaches are often well grounded to their plan, plans are equally likely to contain hallucinations compared to summaries. As a result, the plan-guided summaries are just as unfaithful as those from models without planning. Our work serves as a cautionary tale to plan-guided approaches to summarization, especially for long, complex domains such as narrative texts. Code available at https://github.com/amazon-science/plan-guided-summarization
Chinese: 计划引导的摘要方法在小型语言模型中未能显著提升长篇叙事文本摘要的质量或忠实度,因为计划本身同样容易出现虚构内容,导致该方法效果不佳。
English: Plan-guided summarization in small language models does not significantly enhance the quality or faithfulness of summaries for long narrative texts, as plans themselves are prone to hallucinations, rendering the approach ineffective.

Authors:Adrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui
Title: From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
Abstract:
Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.
中文: 本研究提出了一种新颖的幽默检测指标,用于评估大语言模型从单口喜剧文本中识别幽默笑点的能力,结果显示顶尖模型最高达到51%的准确率——超过人类评估者的41%——同时揭示了幽默提取的主观性与复杂性。
English: This study introduces a novel humor detection metric to evaluate large language models' ability to identify humorous punchlines from stand-up comedy transcripts, revealing that top models achieve up to 51% accuracy—surpassing human evaluators' 41%—while highlighting the subjectivity and complexity of humor extraction.

Authors:Zhengke Sun, Hangwei Qian, Ivor Tsang
Title: Exploring the Effectiveness and Interpretability of Texts in LLM-based Time Series Models
Abstract:
Large Language Models (LLMs) have been applied to time series forecasting tasks, leveraging pre-trained language models as the backbone and incorporating textual data to purportedly enhance the comprehensive capabilities of LLMs for time series. However, are these texts really helpful for interpretation? This study seeks to investigate the actual efficacy and interpretability of such textual incorporations. Through a series of empirical experiments on textual prompts and textual prototypes, our findings reveal that the misalignment between two modalities exists, and the textual information does not significantly improve time series forecasting performance in many cases. Furthermore, visualization analysis indicates that the textual representations learned by existing frameworks lack sufficient interpretability when applied to time series data. We further propose a novel metric named Semantic Matching Index (SMI) to better evaluate the matching degree between time series and texts during our post hoc interpretability investigation. Our analysis reveals the misalignment and limited interpretability of texts in current time-series LLMs, and we hope this study can raise awareness of the interpretability of texts for time series. The code is available at https://github.com/zachysun/TS-Lang-Exp.
中文摘要:本研究质疑文本数据在时间序列大语言模型中的有效性,发现由于模态不匹配,文本信息通常无法提升预测性能且缺乏可解释性。
English Summary: This study questions the effectiveness of text integration in time series forecasting with LLMs, finding that textual data often fails to improve performance or provide clear interpretability due to modality misalignment.

Authors:Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Qiushi Sun, Kanzhi Cheng, Junxian He, Jun Liu, Zhiyong Wu
Title: Genius: A Generalizable and Purely Unsupervised Self-Training Framework For Advanced Reasoning
Abstract:
Advancing LLM reasoning skills has captivated wide interest. However, current post-training techniques rely heavily on supervisory signals, such as outcome supervision or auxiliary reward models, which face the problem of scalability and high annotation costs. This motivates us to enhance LLM reasoning without the need for external supervision. We introduce a generalizable and purely unsupervised self-training framework, named Genius. Without external auxiliary, Genius requires to seek the optimal response sequence in a stepwise manner and optimize the LLM. To explore the potential steps and exploit the optimal ones, Genius introduces a stepwise foresight re-sampling strategy to sample and estimate the step value by simulating future outcomes. Further, we recognize that the unsupervised setting inevitably induces the intrinsic noise and uncertainty. To provide a robust optimization, we propose an advantage-calibrated optimization (ACO) loss function to mitigate estimation inconsistencies. Combining these techniques together, Genius provides an advanced initial step towards self-improve LLM reasoning with general queries and without supervision, revolutionizing reasoning scaling laws given the vast availability of general queries. The code will be released at https://github.com/xufangzhi/Genius.
中文总结:Genius是一种无监督自训练框架,通过逐步前瞻重采样和优势校准优化技术,无需外部监督即可提升大语言模型的推理能力,有效解决了扩展性和标注成本问题。
English Summary: Genius is an unsupervised self-training framework that enhances LLM reasoning through stepwise foresight re-sampling and advantage-calibrated optimization, eliminating the need for external supervision while improving scalability.

Authors:Nicola Horst, Davide Mazzaccara, Antonia Schmidt, Michael Sullivan, Filippo Momentè, Luca Franceschetti, Philipp Sadler, Sherzod Hakimov, Alberto Testoni, Raffaella Bernardi, Raquel Fernández, Alexander Koller, Oliver Lemon, David Schlangen, Mario Giulianelli, Alessandro Suglia
Title: Playpen: An Environment for Exploring Learning Through Conversational Interaction
Abstract:
Interaction between learner and feedback-giver has come into focus recently for post-training of Large Language Models (LLMs), through the use of reward models that judge the appropriateness of a model's response. In this paper, we investigate whether Dialogue Games -- goal-directed and rule-governed activities driven predominantly by verbal actions -- can also serve as a source of feedback signals for learning. We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play, and investigate a representative set of post-training methods: supervised fine-tuning; direct alignment (DPO); and reinforcement learning with GRPO. We experiment with post-training a small LLM (Llama-3.1-8B-Instruct), evaluating performance on unseen instances of training games as well as unseen games, and on standard benchmarks. We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills, while interactive learning with GRPO shows balanced improvements without loss of skills. We release the framework and the baseline training setups to foster research in the promising new direction of learning in (synthetic) interaction.
中文: 本研究探讨将对话游戏作为大语言模型后训练的反馈来源,通过自博弈学习环境Playpen发现交互式强化学习(GRPO)能在保持各项技能的同时实现均衡的性能提升。
English: This study explores using Dialogue Games as a feedback source for post-training LLMs, introducing Playpen for self-play learning and finding that interactive reinforcement learning (GRPO) achieves balanced skill improvements without degradation.

Authors:Ye Ye
Title: Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks
Abstract:
Large Language Models (LLMs) are increasingly used as autonomous agents for multi-step tasks. However, most existing frameworks fail to maintain a structured understanding of the task state, often relying on linear prompt concatenation or shallow memory buffers. This leads to brittle performance, frequent hallucinations, and poor long-range coherence. In this work, we propose the Task Memory Engine (TME), a lightweight and structured memory module that tracks task execution using a hierarchical Task Memory Tree (TMT). Each node in the tree corresponds to a task step, storing relevant input, output, status, and sub-task relationships. We introduce a prompt synthesis method that dynamically generates LLM prompts based on the active node path, significantly improving execution consistency and contextual grounding. Through case studies and comparative experiments on multi-step agent tasks, we demonstrate that TME leads to better task completion accuracy and more interpretable behavior with minimal implementation overhead. A reference implementation of the core TME components is available at https://github.com/biubiutomato/TME-Agent, including basic examples and structured memory integration. While the current implementation uses a tree-based structure, TME is designed to be graph-aware, supporting reusable substeps, converging task paths, and shared dependencies. This lays the groundwork for future DAG-based memory architectures.
Chinese: 本文提出任务记忆引擎(TME),通过分层任务记忆树结构追踪多步骤任务执行状态并动态生成提示,以最小实现成本显著提升大语言模型代理的任务完成准确性和可解释性。
English: This paper introduces the Task Memory Engine (TME), a structured memory module that uses a hierarchical tree to track multi-step task execution and dynamically generate prompts, improving LLM agent performance with minimal overhead.

Authors:Daniil Larionov, Sotaro Takeshita, Ran Zhang, Yanran Chen, Christoph Leiter, Zhipin Wang, Christian Greisinger, Steffen Eger
Title: DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?
Abstract:
Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across machine translation and text summarization evaluation tasks. We evaluate eight models spanning state-of-the-art reasoning models (DeepSeek-R1, OpenAI o3), their distilled variants (8B-70B parameters), and equivalent non-reasoning LLMs. Experiments on WMT23 and SummEval benchmarks reveal architecture and task-dependent benefits: OpenAI o3-mini models show improved performance with increased reasoning on MT, while DeepSeek-R1 and generally underperforms compared to its non-reasoning variant except in summarization consistency evaluation. Correlation analysis demonstrates that reasoning token usage correlates with evaluation quality only in specific models, while almost all models generally allocate more reasoning tokens when identifying more quality issues. Distillation maintains reasonable performance up to 32B parameter models but degrades substantially at 8B scale. This work provides the first assessment of reasoning LLMs for NLG evaluation and comparison to non-reasoning models. We share our code to facilitate further research: https://github.com/NL2G/reasoning-eval.
中文: 本研究首次评估了具备推理能力的大语言模型在自然语言生成任务中的表现,发现其相对于非推理模型的优势因架构和任务而异,同时证明模型蒸馏在32B参数规模内仍能保持良好性能。
English: This study pioneers the evaluation of reasoning-enabled large language models for assessing natural language generation tasks, revealing that their performance advantages over non-reasoning models vary by architecture and task while demonstrating that model distillation remains effective down to 32B parameters.

Authors:Ingryd V. S. T. Pereira, George D. C. Cavalcanti, Rafael M. O. Cruz
Title: Multi-view autoencoders for Fake News Detection
Abstract:
Given the volume and speed at which fake news spreads across social media, automatic fake news detection has become a highly important task. However, this task presents several challenges, including extracting textual features that contain relevant information about fake news. Research about fake news detection shows that no single feature extraction technique consistently outperforms the others across all scenarios. Nevertheless, different feature extraction techniques can provide complementary information about the textual data and enable a more comprehensive representation of the content. This paper proposes using multi-view autoencoders to generate a joint feature representation for fake news detection by integrating several feature extraction techniques commonly used in the literature. Experiments on fake news datasets show a significant improvement in classification performance compared to individual views (feature representations). We also observed that selecting a subset of the views instead of composing a latent space with all the views can be advantageous in terms of accuracy and computational effort. For further details, including source codes, figures, and datasets, please refer to the project's repository: https://github.com/ingrydpereira/multiview-fake-news.
Chinese: 本文提出了一种多视图自编码器方法,通过整合多种特征提取技术来提升虚假新闻检测效果,实验表明选择性融合互补文本特征可显著提高分类性能并优化计算效率。
English: This paper proposes a multi-view autoencoder approach that integrates multiple feature extraction techniques to enhance fake news detection, achieving improved classification performance and efficiency by selectively combining complementary textual features.

Authors:Yutaro Yamada, Robert Tjarko Lange, Cong Lu, Shengran Hu, Chris Lu, Jakob Foerster, Jeff Clune, David Ha
Title: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search
Abstract:
AI is increasingly playing a pivotal role in transforming how scientific discoveries are made. We introduce The AI Scientist-v2, an end-to-end agentic system capable of producing the first entirely AI generated peer-review-accepted workshop paper. This system iteratively formulates scientific hypotheses, designs and executes experiments, analyzes and visualizes data, and autonomously authors scientific manuscripts. Compared to its predecessor (v1, Lu et al., 2024 arXiv:2408.06292), The AI Scientist-v2 eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent. Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures. We evaluated The AI Scientist-v2 by submitting three fully autonomous manuscripts to a peer-reviewed ICLR workshop. Notably, one manuscript achieved high enough scores to exceed the average human acceptance threshold, marking the first instance of a fully AI-generated paper successfully navigating a peer review. This accomplishment highlights the growing capability of AI in conducting all aspects of scientific research. We anticipate that further advancements in autonomous scientific discovery technologies will profoundly impact human knowledge generation, enabling unprecedented scalability in research productivity and significantly accelerating scientific breakthroughs, greatly benefiting society at large. We have open-sourced the code at https://github.com/SakanaAI/AI-Scientist-v2 to foster the future development of this transformative technology. We also discuss the role of AI in science, including AI safety.
中文: AI科学家-v2系统首次实现了完全由人工智能生成且通过同行评审的学术论文,标志着人工智能已具备自主开展完整科学研究的能力。
English: The AI Scientist-v2 is an autonomous system that successfully produced the first fully AI-generated peer-review-accepted scientific paper, demonstrating AI's growing capability to conduct end-to-end research without human intervention.

Authors:Miguel López-Otal, Jorge Gracia, Jordi Bernad, Carlos Bobed, Lucía Pitarch-Ballesteros, Emma Anglés-Herrero
Title: Linguistic Interpretability of Transformer-based Language Models: a systematic review
Abstract:
Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little is known about how their internal computations help them achieve their results. This renders these models, as of today, a type of 'black box' systems. There is, however, a line of research -- 'interpretability' -- aiming to learn how information is encoded inside these models. More specifically, there is work dedicated to studying whether Transformer-based models possess knowledge of linguistic phenomena similar to human speakers -- an area we call 'linguistic interpretability' of these models. In this survey we present a comprehensive analysis of 160 research works, spread across multiple languages and models -- including multilingual ones -- that attempt to discover linguistic information from the perspective of several traditional Linguistics disciplines: Syntax, Morphology, Lexico-Semantics and Discourse. Our survey fills a gap in the existing interpretability literature, which either not focus on linguistic knowledge in these models or present some limitations -- e.g. only studying English-based models. Our survey also focuses on Pre-trained Language Models not further specialized for a downstream task, with an emphasis on works that use interpretability techniques that explore models' internal representations.
中文: 本综述系统分析了160项研究,通过考察多语言Transformer模型在句法、形态、词汇语义及语篇层面的内部表征,填补了现有可解释性研究聚焦英语模型或忽视语言知识的空白。
English: This survey comprehensively analyzes 160 studies exploring how Transformer-based language models encode linguistic knowledge across syntax, morphology, semantics, and discourse, addressing gaps in interpretability research by examining multilingual models beyond English-specific limitations.

Authors:Biplav Srivastava, Kausik Lakkaraju, Nitin Gupta, Vansh Nagpal, Bharath C. Muppasani, Sara E. Jones
Title: SafeChat: A Framework for Building Trustworthy Collaborative Assistants and a Case Study of its Usefulness
Abstract:
Collaborative assistants, or chatbots, are data-driven decision support systems that enable natural interaction for task completion. While they can meet critical needs in modern society, concerns about their reliability and trustworthiness persist. In particular, Large Language Model (LLM)-based chatbots like ChatGPT, Gemini, and DeepSeek are becoming more accessible. However, such chatbots have limitations, including their inability to explain response generation, the risk of generating problematic content, the lack of standardized testing for reliability, and the need for deep AI expertise and extended development times. These issues make chatbots unsuitable for trust-sensitive applications like elections or healthcare. To address these concerns, we introduce SafeChat, a general architecture for building safe and trustworthy chatbots, with a focus on information retrieval use cases. Key features of SafeChat include: (a) safety, with a domain-agnostic design where responses are grounded and traceable to approved sources (provenance), and 'do-not-respond' strategies to prevent harmful answers; (b) usability, with automatic extractive summarization of long responses, traceable to their sources, and automated trust assessments to communicate expected chatbot behavior, such as sentiment; and (c) fast, scalable development, including a CSV-driven workflow, automated testing, and integration with various devices. We implemented SafeChat in an executable framework using the open-source chatbot platform Rasa. A case study demonstrates its application in building ElectionBot-SC, a chatbot designed to safely disseminate official election information. SafeChat is being used in many domains, validating its potential, and is available at: https://github.com/ai4society/trustworthy-chatbot.
中文: 协作式聊天机器人存在可靠性问题,因此推出了SafeChat架构,通过可追溯来源的响应和自动信任评估,确保在选举和医疗等敏感领域的安全可信应用。
English: Collaborative chatbots like ChatGPT face trust issues due to limitations in explainability and safety, prompting the development of SafeChat—a secure architecture ensuring traceable, source-grounded responses for reliable applications such as elections and healthcare.

Authors:Runjin Chen, Zhenyu Zhang, Junyuan Hong, Souvik Kundu, Zhangyang Wang
Title: SEAL: Steerable Reasoning Calibration of Large Language Models for Free
Abstract:
Large Language Models (LLMs), such as OpenAI's o1-series have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. However, recent studies reveal substantial redundancy in the CoT reasoning traces, which not only increases inference latency but also negatively impacts model performance by diverting attention to unnecessary reasoning paths. To address this issue, we investigate the internal reasoning structures of LLMs and categorize them into three primary thought types: execution, reflection, and transition thoughts. Moreover, our analysis reveals that excessive reflection and transition thoughts are strongly correlated with failure cases and these thought categories exhibit clear separation in the latent space. Based on these, we introduce SEAL (Steerable reasoning calibration), a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains. SEAL consists of an offline stage for extracting the reasoning steering vector in the latent space, followed by an on-the-fly calibration of the reasoning trace through representation intervention using the steering vector. Notably, the steering vector exhibits strong transferability across various tasks. Extensive experiments across multiple models (DeepSeek-R1-Distill and QwQ-32B-Preview) and benchmarks (Math500, GSM8K, LiveCodeBench) validate the effectiveness of SEAL, up to a 11% improvement in accuracy while reducing reasoning tokens by 11.8% to 50.4%. Our code is publicly available at https://github.com/VITA-Group/SEAL.
中文: SEAL是一种无需训练的方法,通过潜在空间导向向量校准大语言模型的思维链推理过程,在将推理标记减少11.8%至50.4%的同时,实现了最高11%的准确率提升。
English: SEAL is a training-free method that enhances the accuracy and efficiency of large language models by calibrating their chain-of-thought reasoning through latent space steering vectors, achieving up to 11% higher accuracy while reducing reasoning tokens by 11.8% to 50.4%.

Authors:En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, Xiangyu Zhang, Daxin Jiang, Jingyu Wang, Wenbing Tao
Title: Perception-R1: Pioneering Perception Policy with Reinforcement Learning
Abstract:
Inspired by the success of DeepSeek-R1, we explore the potential of rule-based reinforcement learning (RL) in MLLM post-training for perception policy learning. While promising, our initial experiments reveal that incorporating a thinking process through RL does not consistently lead to performance gains across all visual perception tasks. This leads us to delve into the essential role of RL in the context of visual perception. In this work, we return to the fundamentals and explore the effects of RL on different perception tasks. We observe that the perceptual complexity is a major factor in determining the effectiveness of RL. We also observe that reward design plays a crucial role in further approching the upper limit of model perception. To leverage these findings, we propose Perception-R1, a scalable RL framework using GRPO during MLLM post-training. With a standard Qwen2.5-VL-3B-Instruct, Perception-R1 achieves +4.2% on RefCOCO+, +17.9% on PixMo-Count, +4.2% on PageOCR, and notably, 31.9% AP on COCO2017 val for the first time, establishing a strong baseline for perception policy learning.
中文: 本研究提出Perception-R1强化学习框架,通过优化感知复杂度处理与奖励机制设计,在多类视觉感知任务中实现显著性能提升,为感知策略学习建立了新基准。
English: This study introduces Perception-R1, a scalable reinforcement learning framework that enhances visual perception tasks by addressing perceptual complexity and reward design, achieving significant performance improvements across multiple benchmarks.

Authors:Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou
Title: Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
Abstract:
Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.
中文: 动态备忘单(DC)是一种轻量级框架,为黑盒语言模型赋予持久记忆能力,使其能在推理时存储和复用解题策略,无需真实标签或人工反馈即可显著提升各类任务的表现。
English: Dynamic Cheatsheet (DC) is a lightweight framework that equips black-box language models with persistent memory, enabling them to store and reuse problem-solving insights at inference time, which substantially enhances performance across various tasks without requiring ground-truth labels or human feedback.

Authors:Bo Zhang, Hui Ma, Dailin Li, Jian Ding, Jian Wang, Bo Xu, HongFei Lin
Title: Efficient Tuning of Large Language Models for Knowledge-Grounded Dialogue Generation
Abstract:
Large language models (LLMs) demonstrate remarkable text comprehension and generation capabilities but often lack the ability to utilize up-to-date or domain-specific knowledge not included in their training data. To address this gap, we introduce KEDiT, an efficient method for fine-tuning LLMs for knowledge-grounded dialogue generation. KEDiT operates in two main phases: first, it employs an information bottleneck to compress retrieved knowledge into learnable parameters, retaining essential information while minimizing computational overhead. Second, a lightweight knowledge-aware adapter integrates these compressed knowledge vectors into the LLM during fine-tuning, updating less than 2\% of the model parameters. The experimental results on the Wizard of Wikipedia and a newly constructed PubMed-Dialog dataset demonstrate that KEDiT excels in generating contextually relevant and informative responses, outperforming competitive baselines in automatic, LLM-based, and human evaluations. This approach effectively combines the strengths of pretrained LLMs with the adaptability needed for incorporating dynamic knowledge, presenting a scalable solution for fields such as medicine.
中文: KEDiT通过将检索到的知识压缩为可学习参数,并利用轻量级适配器将其整合到大型语言模型中,以极少的参数更新实现了上下文相关且信息丰富的对话生成。
English: KEDiT efficiently fine-tunes large language models by compressing retrieved knowledge into learnable parameters and integrating them via a lightweight adapter, enabling contextually relevant dialogue generation with minimal parameter updates.

Authors:Xiaowu Zhang, Hongfei Zhao, Jingyi Hou, Zhijie Liu
Title: Unveiling the Impact of Multimodal Features on Chinese Spelling Correction: From Analysis to Design
Abstract:
The Chinese Spelling Correction (CSC) task focuses on detecting and correcting spelling errors in sentences. Current research primarily explores two approaches: traditional multimodal pre-trained models and large language models (LLMs). However, LLMs face limitations in CSC, particularly over-correction, making them suboptimal for this task. While existing studies have investigated the use of phonetic and graphemic information in multimodal CSC models, effectively leveraging these features to enhance correction performance remains a challenge. To address this, we propose the Multimodal Analysis for Character Usage (\textbf{MACU}) experiment, identifying potential improvements for multimodal correctison. Based on empirical findings, we introduce \textbf{NamBert}, a novel multimodal model for Chinese spelling correction. Experiments on benchmark datasets demonstrate NamBert's superiority over SOTA methods. We also conduct a comprehensive comparison between NamBert and LLMs, systematically evaluating their strengths and limitations in CSC. Our code and model are available at https://github.com/iioSnail/NamBert.
中文: 本研究提出了NamBert这一新型多模态中文拼写纠错模型,通过有效利用语音和字形特征超越了现有方法,同时系统评估了大语言模型在此任务中的局限性。
English: The study introduces NamBert, a novel multimodal model for Chinese spelling correction that outperforms current methods by effectively leveraging phonetic and graphemic features, while also highlighting limitations of large language models in this task.

Authors:Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, Tiancheng Zhao
Title: VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Abstract:
Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1
中文: DeepSeek R1 基于确定性答案的规则奖励强化学习方法被成功扩展到视觉语言模型VLM-R1中,有效提升了视觉推理任务的性能表现和泛化能力。
English: DeepSeek R1's reinforcement learning approach, using rule-based rewards for tasks with clear answers, is successfully extended to vision-language models through VLM-R1, enhancing both performance and generalization in visual reasoning tasks.

Authors:Patrick Fernandes, Sweta Agrawal, Emmanouil Zaranis, André F. T. Martins, Graham Neubig
Title: Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering
Abstract:
Despite the steady progress in machine translation evaluation, existing automatic metrics struggle to capture how well meaning is preserved beyond sentence boundaries. We posit that reliance on a single intrinsic quality score, trained to mimic human judgments, might be insufficient for evaluating translations of long, complex passages, and a more ``pragmatic'' approach that assesses how accurately key information is conveyed by a translation in context is needed. We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality by assessing how accurately candidate translations answer reading comprehension questions that target key information in the original source or reference texts. In challenging domains that require long-range understanding, such as literary texts, we show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations, despite never being explicitly optimized to correlate with human judgments. Furthermore, the generated questions and answers offer interpretability: empirical analysis shows that they effectively target translation errors identified by experts in evaluated datasets. Our code is available at https://github.com/deep-spin/treqa
中文: 现有机器翻译自动评估指标难以衡量跨句子的意义保留,因此我们提出TREQA框架,通过测试翻译文本对原文关键信息的阅读理解问题回答准确性来评估质量,在复杂领域表现优异且提供可解释性。
English: Current automatic metrics for machine translation evaluation often fail to assess meaning preservation beyond individual sentences, prompting the introduction of TREQA, a pragmatic framework that evaluates translations by testing how well they answer comprehension questions about key information in the source text, showing competitive performance and enhanced interpretability in complex domains.

Authors:Juzheng Zhang, Jiacheng You, Ashwinee Panda, Tom Goldstein
Title: LoRI: Reducing Cross-Task Interference in Multi-Task Low-Rank Adaptation
Abstract:
Low-Rank Adaptation (LoRA) has emerged as a popular parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), yet it still incurs notable overhead and suffers from parameter interference in multi-task scenarios. We propose LoRA with Reduced Interference (LoRI), a simple yet effective approach that freezes the projection matrices $A$ as random projections and sparsifies the matrices $B$ using task-specific masks. This design substantially reduces the number of trainable parameters while maintaining strong task performance. Moreover, LoRI minimizes cross-task interference in adapter merging by leveraging the orthogonality between adapter subspaces, and supports continual learning by using sparsity to mitigate catastrophic forgetting. Extensive experiments across natural language understanding, mathematical reasoning, code generation, and safety alignment tasks demonstrate that LoRI outperforms full fine-tuning and existing PEFT methods, while using up to 95% fewer trainable parameters than LoRA. In multi-task experiments, LoRI enables effective adapter merging and continual learning with reduced cross-task interference. Code is available at: https://github.com/juzhengz/LoRI
中文: LoRI是一种改进的微调方法,通过冻结随机投影矩阵并应用任务特定稀疏化,在保持优异性能的同时大幅减少可训练参数,相比LoRA减少高达95%参数且有效降低跨任务干扰。
English: LoRI is an enhanced fine-tuning method that freezes random projection matrices and applies task-specific sparsity to significantly reduce trainable parameters while outperforming full fine-tuning and existing PEFT methods, with up to 95% fewer parameters than LoRA and reduced cross-task interference.

Authors:Yixin Cao, Jiahao Ying, Yaoning Wang, Xipeng Qiu, Xuanjing Huang, Yugang Jiang
Title: Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric
Abstract:
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications, yet current evaluation methods struggle to keep pace with their rapid development. One core challenge of evaluation in the large language model (LLM) era is the generalization issue: how to infer a model's near-unbounded abilities from inevitably bounded benchmarks. We address this challenge by proposing Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores. MUI quantifies the effort a model expends on a task, defined as the proportion of activated neurons or features during inference. Intuitively, a truly capable model should achieve higher performance with lower effort. Extensive experiments across popular LLMs reveal a consistent inverse logarithmic relationship between MUI and performance, which we formulate as the Utility Law. From this law we derive four practical corollaries that (i) guide training diagnostics, (ii) expose data contamination issue, (iii) enable fairer model comparisons, and (iv) design model-specific dataset diversity. Our code can be found at https://github.com/ALEX-nlp/MUI-Eva.
中文: 本文提出模型利用指数(MUI),通过量化推理过程中激活神经元比例来评估大语言模型效率,发现其与性能呈反比对数关系,并为模型评估与优化提供了四项实用推论。
English: This paper introduces the Model Utilization Index (MUI), an interpretable metric that measures the proportion of activated neurons during inference to assess LLMs' efficiency, revealing an inverse logarithmic relationship with performance and offering practical applications for model evaluation and improvement.

Authors:Qi Liu, Haozhe Duan, Yiqun Chen, Quanfeng Lu, Weiwei Sun, Jiaxin Mao
Title: LLM4Ranking: An Easy-to-use Framework of Utilizing Large Language Models for Document Reranking
Abstract:
Utilizing large language models (LLMs) for document reranking has been a popular and promising research direction in recent years, many studies are dedicated to improving the performance and efficiency of using LLMs for reranking. Besides, it can also be applied in many real-world applications, such as search engines or retrieval-augmented generation. In response to the growing demand for research and application in practice, we introduce a unified framework, \textbf{LLM4Ranking}, which enables users to adopt different ranking methods using open-source or closed-source API-based LLMs. Our framework provides a simple and extensible interface for document reranking with LLMs, as well as easy-to-use evaluation and fine-tuning scripts for this task. We conducted experiments based on this framework and evaluated various models and methods on several widely used datasets, providing reproducibility results on utilizing LLMs for document reranking. Our code is publicly available at https://github.com/liuqi6777/llm4ranking.
Chinese: LLM4Ranking框架为使用大语言模型进行文档重排提供了一个统一且可扩展的接口,使用户能够通过公开代码评估和微调多种模型与方法。
English: The LLM4Ranking framework provides a unified and extensible interface for document reranking using large language models, enabling users to evaluate and fine-tune various models and methods with publicly available code.

Authors:Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal
Title: Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
Abstract:
Post-training quantization (PTQ) reduces a model's memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits -- which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct's unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ's ability to identify important weights is not limited to task-conditioned settings.
Chinese: TaCQ是一种新颖的混合精度训练后量化方法,通过保留任务关键权重为16位精度同时量化其余权重,在低比特设置下以最小内存开销实现卓越的性能恢复。
English: TaCQ is a novel mixed-precision post-training quantization method that preserves task-critical weights at 16-bit precision while quantizing others, achieving superior performance recovery in low-bit settings with minimal memory overhead.

Authors:Mingxuan Li, Hanchen Li, Chenhao Tan
Title: HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Abstract:
Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting any human input, which leads to low alignment, or fine-tune LLMs on labeled data, which requires a non-trivial number of samples. Moreover, previous methods often provide little reasoning behind automated evaluations. In this paper, we propose HypoEval, Hypothesis-guided Evaluation framework, which first uses a small corpus of human evaluations to generate more detailed rubrics for human judgments and then incorporates a checklist-like approach to combine LLM's assigned scores on each decomposed dimension to acquire overall scores. With only 30 human evaluations, HypoEval achieves state-of-the-art performance in alignment with both human rankings (Spearman correlation) and human scores (Pearson correlation), on average outperforming G-Eval by 11.86% and fine-tuned Llama-3.1-8B-Instruct with at least 3 times more human evaluations by 11.95%. Furthermore, we conduct systematic studies to assess the robustness of HypoEval, highlighting its effectiveness as a reliable and interpretable automated evaluation framework.
中文摘要:HypoEval是一种创新框架,通过少量人工评估生成详细评分标准并采用清单式方法整合维度得分,仅用30个样本即可实现与人类评估的最佳对齐,同时提供可解释的自动化评测。
English Summary: HypoEval is a novel framework that enhances LLM-based evaluation by generating detailed rubrics from minimal human input and using a checklist approach to combine dimension scores, achieving superior alignment with human judgments using only 30 samples while providing interpretable reasoning.

Authors:Will LeVine, Bijan Varjavand
Title: Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking
Abstract:
Modern Large Language Model (LLM) systems typically rely on Retrieval Augmented Generation (RAG) which aims to gather context that is useful for response generation. These RAG systems typically optimize strictly towards retrieving context that is maximally relevant to the query. However, conventional theory suggests that retrieval systems which seek to maximize context relevance without any additional explicit criteria can create information bottlenecks. We reaffirm this finding in the modern age of LLM's by showing that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality. In response, we show evaluations of existing RAG methods which account for both context relevance and answer quality. These evaluations introduce a novel finding that existing RAG systems scale poorly with inference time compute usage when considering our combined metric. We introduce "RErank BEyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute via injection of multi-criteria optimization using Chain-of-Thought prompting (and optionally Multi-Turn dialogue). Ultimately, this enables a new performance/speed tradeoff curve, where RAG systems are able to achieve both higher relevance of retrieved contexts and superior answer quality as inference time increases. Code for the implementation of our method in llama-index can be found at the following PR: https://github.com/run-llama/llama_index/pull/17590. Code for running experiments using this llama-index implementation can be found at https://github.com/microsoft/REBEL.
中文: 现代RAG系统若仅优化上下文相关性会制约回答质量,而REBEL方法通过多标准推理优化,实现了检索相关性和答案质量的双重提升。
English: Modern RAG systems often focus solely on maximizing context relevance, which can limit response quality, but the new REBEL method improves both relevance and answer quality by optimizing multiple criteria during inference.

Authors:Yubin Hong, Chaofan Li, Jingyi Zhang, Yingxia Shao
Title: FG-RAG: Enhancing Query-Focused Summarization with Context-Aware Fine-Grained Graph RAG
Abstract:
Retrieval-Augmented Generation (RAG) enables large language models to provide more precise and pertinent responses by incorporating external knowledge. In the Query-Focused Summarization (QFS) task, GraphRAG-based approaches have notably enhanced the comprehensiveness and diversity of generated responses. However, existing GraphRAG-based approaches predominantly focus on coarse-grained information summarization without being aware of the specific query, and the retrieved content lacks sufficient contextual information to generate comprehensive responses. To address the deficiencies of current RAG systems, we propose Context-Aware Fine-Grained Graph RAG (FG-RAG) to enhance the performance of the QFS task. FG-RAG employs Context-Aware Entity Expansion in graph retrieval to expand the coverage of retrieved entities in the graph, thus providing enough contextual information for the retrieved content. Furthermore, FG-RAG utilizes Query-Level Fine-Grained Summarization to incorporate fine-grained details during response generation, enhancing query awareness for the generated summarization. Our evaluation demonstrates that FG-RAG outperforms other RAG systems in multiple metrics of comprehensiveness, diversity, and empowerment when handling the QFS task. Our implementation is available at https://github.com/BuptWululu/FG-RAG.
中文: FG-RAG通过上下文感知的实体扩展和查询级细粒度摘要,显著提升了查询聚焦摘要任务的全面性、多样性和赋能效果,优于现有RAG系统。
English: FG-RAG enhances Query-Focused Summarization by expanding entity coverage and incorporating fine-grained details, outperforming existing RAG systems in comprehensiveness, diversity, and empowerment.

Authors:Yiting Lu, Jiakang Yuan, Zhen Li, Shitian Zhao, Qi Qin, Xinyue Li, Le Zhuo, Licheng Wen, Dongyang Liu, Yuewen Cao, Xiangchao Yan, Xin Li, Tianshuo Peng, Shufei Zhang, Botian Shi, Tao Chen, Zhibo Chen, Lei Bai, Peng Gao, Bo Zhang
Title: OmniCaptioner: One Captioner to Rule Them All
Abstract:
We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
中文: OmniCaptioner 是一个通用的视觉描述框架,可为多种视觉领域生成精细的文本描述,增强大语言模型的视觉推理能力、改进图像生成并实现高效的监督微调。
English: OmniCaptioner is a unified visual captioning framework that generates detailed descriptions for diverse visual domains, enhancing visual reasoning with LLMs, improving image generation, and enabling efficient supervised fine-tuning.

Authors:Liang-Hsuan Tseng, Yi-Chang Chen, Kuan-Yi Lee, Da-Shan Shiu, Hung-yi Lee
Title: TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling
Abstract:
Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech tokens for joint modeling remains underexplored. To address this, we introduce Text-Aligned Speech Tokenization and Embedding (TASTE), a method that directly addresses the modality gap by aligning speech token with the corresponding text transcription during the tokenization stage. We propose a method that can achieve this through a attention-based aggregation mechanism and with speech reconstruction as the training objective. We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length. With TASTE, we perform straightforward joint spoken language modeling by using Low-Rank Adaptation on the pre-trained text LLM. Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze; while significantly outperform other pre-trained SLMs on speech continuation across subjective and objective evaluations. To our knowledge, TASTE is the first end-to-end approach that utilizes a reconstruction objective to automatically learn a text-aligned speech tokenization and embedding suitable for spoken language modeling. Our demo, code, and model are available at https://mtkresearch.github.io/TASTE-SpokenLM.github.io.

Authors:Jifang Wang, Xue Yang, Longyue Wang, Zhenran Xu, Yiyu Wang, Yaowei Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang
Title: A Unified Agentic Framework for Evaluating Conditional Image Generation
Abstract:
Conditional image generation has gained significant attention for its ability to personalize content. However, the field faces challenges in developing task-agnostic, reliable, and explainable evaluation metrics. This paper introduces CIGEval, a unified agentic framework for comprehensive evaluation of conditional image generation tasks. CIGEval utilizes large multimodal models (LMMs) as its core, integrating a multi-functional toolbox and establishing a fine-grained evaluation framework. Additionally, we synthesize evaluation trajectories for fine-tuning, empowering smaller LMMs to autonomously select appropriate tools and conduct nuanced analyses based on tool outputs. Experiments across seven prominent conditional image generation tasks demonstrate that CIGEval (GPT-4o version) achieves a high correlation of 0.4625 with human assessments, closely matching the inter-annotator correlation of 0.47. Moreover, when implemented with 7B open-source LMMs using only 2.3K training trajectories, CIGEval surpasses the previous GPT-4o-based state-of-the-art method. Case studies on GPT-4o image generation highlight CIGEval's capability in identifying subtle issues related to subject consistency and adherence to control guidance, indicating its great potential for automating evaluation of image generation tasks with human-level reliability.
中文: 本文提出CIGEval框架,利用大型多模态模型全面评估条件图像生成任务,在多项实验中达到接近人工评估的相关性,并超越现有最优方法。
English: This paper presents CIGEval, an agentic framework using large multimodal models to comprehensively evaluate conditional image generation, achieving near-human correlation in assessments and surpassing previous state-of-the-art methods.

Authors:Zhixuan Lin, Johan Obando-Ceron, Xu Owen He, Aaron Courville
Title: Adaptive Computation Pruning for the Forgetting Transformer
Abstract:
The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. In particular, our method performs provably safe pruning via a dynamically set pruning threshold that guarantees the pruned attention weights are negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs and memory accesses in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 50% to 70% reduction in attention runtime (or a 2-3$\times$ speedup) and a roughly 10% to 40% increase in end-to-end training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
中文摘要:针对遗忘变换器提出的自适应计算剪枝(ACP)方法通过基于遗忘门衰减动态剪枝可忽略计算,在保持性能不变的同时实现注意力计算2-3倍加速和10-40%训练吞吐量提升。
English Summary: The Adaptive Computation Pruning (ACP) method for the Forgetting Transformer dynamically prunes negligible computations based on forget gate decay, achieving 2-3× attention speedup and 10-40% training throughput gains without performance loss.

Authors:Yuxin Wang, Yiran Guo, Yining Zheng, Zhangyue Yin, Shuo Chen, Jie Yang, Jiajun Chen, Yuan Li, Xuanjing Huang, Xipeng Qiu
Title: FamilyTool: A Multi-hop Personalized Tool Use Benchmark
Abstract:
The integration of tool learning with Large Language Models (LLMs) has expanded their capabilities in handling complex tasks by leveraging external tools. However, existing benchmarks for tool learning inadequately address critical real-world personalized scenarios, particularly those requiring multi-hop reasoning and inductive knowledge adaptation in dynamic environments. To bridge this gap, we introduce FamilyTool, a novel benchmark grounded in a family-based knowledge graph (KG) that simulates personalized, multi-hop tool use scenarios. FamilyTool, including base and extended datasets, challenges LLMs with queries spanning from 1 to 4 relational hops (e.g., inferring familial connections and preferences) and 2 to 6 hops respectively, and incorporates an inductive KG setting where models must adapt to unseen user preferences and relationships without re-training, a common limitation in prior approaches that compromises generalization. We further propose KGETool: a simple KG-augmented evaluation pipeline to systematically assess LLMs' tool use ability in these settings. Experiments reveal significant performance gaps in state-of-the-art LLMs, with accuracy dropping sharply as hop complexity increases and inductive scenarios exposing severe generalization deficits. These findings underscore the limitations of current LLMs in handling personalized, evolving real-world contexts and highlight the urgent need for advancements in tool-learning frameworks. FamilyTool serves as a critical resource for evaluating and advancing LLM agents' reasoning, adaptability, and scalability in complex, dynamic environments. Code and dataset are available at \href{https://github.com/yxzwang/FamilyTool}{https://github.com/yxzwang/FamilyTool}.
中文: 本文提出基于家庭知识图谱的新基准FamilyTool,通过个性化多跳推理任务和归纳场景测试大语言模型,揭示了现有模型在处理复杂动态环境时存在的显著性能差距与泛化缺陷。
English: This paper introduces FamilyTool, a novel benchmark based on a family knowledge graph that challenges large language models with personalized, multi-hop reasoning tasks and inductive scenarios, revealing significant performance gaps and generalization issues in current models.

Authors:Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, Shiyu Chang
Title: Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning
Abstract:
Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.
中文: 针对LLM水印技术面临的安全挑战,特别是欺骗攻击可在保留水印的同时恶意篡改文本含义,本研究提出一种语义感知水印算法,通过后处理嵌入水印,在保持高检测率的同时有效抵御移除攻击和欺骗攻击。
English: Watermarking for LLMs faces security challenges from spoofing attacks, which can maliciously alter text meaning while preserving watermarks, and this study proposes a semantic-aware algorithm that embeds watermarks post-hoc to ensure robustness against removal and security against spoofing, maintaining high detectability.

Authors:Nayantara Mudur, Hao Cui, Subhashini Venugopalan, Paul Raccuglia, Michael P. Brenner, Peter Norgaard
Title: FEABench: Evaluating Language Models on Multiphysics Reasoning Ability
Abstract:
Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language models (LLMs) and LLM agents to simulate and solve physics, mathematics and engineering problems using finite element analysis (FEA). We introduce a comprehensive evaluation scheme to investigate the ability of LLMs to solve these problems end-to-end by reasoning over natural language problem descriptions and operating COMSOL Multiphysics$^\circledR$, an FEA software, to compute the answers. We additionally design a language model agent equipped with the ability to interact with the software through its Application Programming Interface (API), examine its outputs and use tools to improve its solutions over multiple iterations. Our best performing strategy generates executable API calls 88% of the time. LLMs that can successfully interact with and operate FEA software to solve problems such as those in our benchmark would push the frontiers of automation in engineering. Acquiring this capability would augment LLMs' reasoning skills with the precision of numerical solvers and advance the development of autonomous systems that can tackle complex problems in the real world. The code is available at https://github.com/google/feabench
中文: FEABench是一个基准测试,旨在评估大型语言模型及其代理通过有限元分析模拟和解决物理、数学及工程问题的能力,以推动人工智能与数值求解器结合,提升工程领域的自动化水平。
English: FEABench is a benchmark designed to assess how well large language models and their agents can simulate and solve physics, math, and engineering problems using finite element analysis, with the goal of enhancing automation in these fields by integrating AI with numerical solvers.

Authors:Krithi Shailya, Shreya Rajpal, Gokul S Krishnan, Balaraman Ravindran
Title: LExT: Towards Evaluating Trustworthiness of Natural Language Explanations
Abstract:
As Large Language Models (LLMs) become increasingly integrated into high-stakes domains, there have been several approaches proposed toward generating natural language explanations. These explanations are crucial for enhancing the interpretability of a model, especially in sensitive domains like healthcare, where transparency and reliability are key. In light of such explanations being generated by LLMs and its known concerns, there is a growing need for robust evaluation frameworks to assess model-generated explanations. Natural Language Generation metrics like BLEU and ROUGE capture syntactic and semantic accuracies but overlook other crucial aspects such as factual accuracy, consistency, and faithfulness. To address this gap, we propose a general framework for quantifying trustworthiness of natural language explanations, balancing Plausibility and Faithfulness, to derive a comprehensive Language Explanation Trustworthiness Score (LExT) (The code and set up to reproduce our experiments are publicly available at https://github.com/cerai-iitm/LExT). Applying our domain-agnostic framework to the healthcare domain using public medical datasets, we evaluate six models, including domain-specific and general-purpose models. Our findings demonstrate significant differences in their ability to generate trustworthy explanations. On comparing these explanations, we make interesting observations such as inconsistencies in Faithfulness demonstrated by general-purpose models and their tendency to outperform domain-specific fine-tuned models. This work further highlights the importance of using a tailored evaluation framework to assess natural language explanations in sensitive fields, providing a foundation for improving the trustworthiness and transparency of language models in healthcare and beyond.
中文: 本文提出了一个领域无关的LExT框架,通过平衡合理性与忠实性来评估自然语言解释的可信度,弥补了现有指标的不足,并在医疗领域应用中揭示了不同模型生成可信解释能力的显著差异。
English: This paper introduces a domain-agnostic framework called LExT to evaluate the trustworthiness of natural language explanations by balancing plausibility and faithfulness, addressing limitations of existing metrics and demonstrating its application in healthcare with significant model performance variations.

Authors:Dongyang Fan, Vinko Sabolčec, Matin Ansaripour, Ayush Kumar Tarun, Martin Jaggi, Antoine Bosselut, Imanol Schlag
Title: Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs
Abstract:
The increasing adoption of web crawling opt-outs by copyright holders of online content raises critical questions about the impact of data compliance on large language model (LLM) performance. However, little is known about how these restrictions (and the resultant filtering of pretraining datasets) affect the capabilities of models trained using these corpora. In this work, we conceptualize this effect as the $\textit{data compliance gap}$ (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. We measure the data compliance gap in two settings: pretraining models from scratch and continual pretraining from existing compliant models (simulating a setting where copyrighted data could be integrated later in pretraining). Our experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0\% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training. Our study provides empirical insights into the long-debated trade-off between data compliance and downstream model performance, informing future discussions on AI training practices and policy decisions. Our website is available at https://data-compliance.github.io/.

Authors:Chejian Xu, Wei Ping, Peng Xu, Zihan Liu, Boxin Wang, Mohammad Shoeybi, Bo Li, Bryan Catanzaro
Title: From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models
Abstract:
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.
中文: 本研究提出一种高效训练方法,可将大语言模型的上下文长度从128K扩展到4M词元,在保持指令遵循和推理能力的同时,在长上下文基准测试中达到最优性能,并在标准任务中保持竞争力。
English: This work introduces an efficient training method to extend LLMs' context length from 128K to 4M tokens while maintaining instruction-following and reasoning capabilities, achieving state-of-the-art performance on long-context benchmarks and competitive results on standard tasks.

Authors:Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis
Title: Multi-Sense Embeddings for Language Models and Knowledge Distillation
Abstract:
Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a limited number of senses (or meanings). We propose multi-sense embeddings as a drop-in replacement for each token in order to capture the range of their uses in a language. To construct a sense embedding dictionary, we apply a clustering algorithm to embeddings generated by an LLM and consider the cluster centers as representative sense embeddings. In addition, we propose a novel knowledge distillation method that leverages the sense dictionary to learn a smaller student model that mimics the senses from the much larger base LLM model, offering significant space and inference time savings, while maintaining competitive performance. Via thorough experiments on various benchmarks, we showcase the effectiveness of our sense embeddings and knowledge distillation approach. We share our code at https://github.com/Qitong-Wang/SenseDict
中文摘要:本文提出多义嵌入替代标准词嵌入以更好捕捉词汇语义,并通过基于义项词典的知识蒸馏方法训练出更小更高效的学生模型,在保持性能的同时显著节省空间和推理时间。
English Summary: This paper introduces multi-sense embeddings as a replacement for standard token embeddings to better capture word meanings, and proposes a knowledge distillation method using a sense dictionary to create smaller, efficient student models while maintaining performance.

Authors:Peerat Limkonchotiwat, Kanruethai Masuk, Surapon Nonesung, Chalermpun Mai-On, Sarana Nutanong, Wuttikorn Ponwitayarat, Potsawee Manakul
Title: Assessing Thai Dialect Performance in LLMs with Automatic Benchmarks and Human Evaluation
Abstract:
Large language models show promising results in various NLP tasks. Despite these successes, the robustness and consistency of LLMs in underrepresented languages remain largely unexplored, especially concerning local dialects. Existing benchmarks also focus on main dialects, neglecting LLMs' ability on local dialect texts. In this paper, we introduce a Thai local dialect benchmark covering Northern (Lanna), Northeastern (Isan), and Southern (Dambro) Thai, evaluating LLMs on five NLP tasks: summarization, question answering, translation, conversation, and food-related tasks. Furthermore, we propose a human evaluation guideline and metric for Thai local dialects to assess generation fluency and dialect-specific accuracy. Results show that LLM performance declines significantly in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 demonstrating some fluency
中文:大型语言模型在泰语方言中的表现远不及标准泰语,仅有GPT-4o和Gemini2等专有模型展现出一定流畅性,这凸显了对小众语言进行更全面评估和改进的必要性。
English: Large language models exhibit significant performance drops in local Thai dialects compared to standard Thai, with only proprietary models like GPT-4o and Gemini2 showing some fluency, highlighting the need for better evaluation and improvement in underrepresented languages.

Authors:Yiming Tang, Yi Fan, Chenxiao Yu, Tiankai Yang, Yue Zhao, Xiyang Hu
Title: StealthRank: LLM Ranking Manipulation via Stealthy Prompt Optimization
Abstract:
The integration of large language models (LLMs) into information retrieval systems introduces new attack surfaces, particularly for adversarial ranking manipulations. We present $\textbf{StealthRank}$, a novel adversarial attack method that manipulates LLM-driven ranking systems while maintaining textual fluency and stealth. Unlike existing methods that often introduce detectable anomalies, StealthRank employs an energy-based optimization framework combined with Langevin dynamics to generate StealthRank Prompts (SRPs)-adversarial text sequences embedded within item or document descriptions that subtly yet effectively influence LLM ranking mechanisms. We evaluate StealthRank across multiple LLMs, demonstrating its ability to covertly boost the ranking of target items while avoiding explicit manipulation traces. Our results show that StealthRank consistently outperforms state-of-the-art adversarial ranking baselines in both effectiveness and stealth, highlighting critical vulnerabilities in LLM-driven ranking systems. Our code is publicly available at $\href{https://github.com/Tangyiming205069/controllable-seo}{here}$.
中文: 本文提出StealthRank攻击方法,通过基于能量的优化框架和朗之万动力学生成隐蔽提示,能在保持文本流畅性的同时有效操纵大语言模型驱动的排序系统,暗中提升目标条目排名且避免被检测到。
English: This paper introduces StealthRank, an adversarial attack method that manipulates LLM-driven ranking systems through energy-based optimization and Langevin dynamics to generate stealthy prompts, effectively boosting target items' rankings while maintaining textual fluency and evading detection.

Authors:Haoyu Wang, Yujia Fu, Zhu Zhang, Shuo Wang, Zirui Ren, Xiaorong Wang, Zhili Li, Chaoqun He, Bo An, Zhiyuan Liu, Maosong Sun
Title: LLM$\times$MapReduce-V2: Entropy-Driven Convolutional Test-Time Scaling for Generating Long-Form Articles from Extremely Long Resources
Abstract:
Long-form generation is crucial for a wide range of practical applications, typically categorized into short-to-long and long-to-long generation. While short-to-long generations have received considerable attention, generating long texts from extremely long resources remains relatively underexplored. The primary challenge in long-to-long generation lies in effectively integrating and analyzing relevant information from extensive inputs, which remains difficult for current large language models (LLMs). In this paper, we propose LLM$\times$MapReduce-V2, a novel test-time scaling strategy designed to enhance the ability of LLMs to process extremely long inputs. Drawing inspiration from convolutional neural networks, which iteratively integrate local features into higher-level global representations, LLM$\times$MapReduce-V2 utilizes stacked convolutional scaling layers to progressively expand the understanding of input materials. Both quantitative and qualitative experimental results demonstrate that our approach substantially enhances the ability of LLMs to process long inputs and generate coherent, informative long-form articles, outperforming several representative baselines. Both LLM$\times$MapReduce-V2 and SurveyEval are publicly available at https://github.com/thunlp/LLMxMapReduce .
中文摘要:本文提出LLM×MapReduce-V2这一新型测试时扩展策略,通过堆叠卷积层逐步扩展对输入材料的理解,显著增强大语言模型处理超长输入并生成连贯长文本的能力。
English Summary: This paper introduces LLM×MapReduce-V2, a novel test-time scaling strategy that enhances large language models' capacity to process extremely long inputs and generate coherent long-form articles by progressively expanding understanding through stacked convolutional layers.

Authors:Hossein Entezari Zarch, Lei Gao, Chaoyi Jiang, Murali Annavaram
Title: DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding
Abstract:
Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed by parallel verification using the target LLM. This approach leads to faster inference compared to auto-regressive decoding. While there are multiple approaches to create a draft model, one promising approach is to use early-exit methods. These methods draft candidate tokens by using a subset of layers of the primary model and applying the remaining layers for verification, allowing a single model to handle both drafting and verification. While this technique reduces memory usage and computational cost, its performance relies on the choice of the exit layer for drafting and the number of tokens drafted (speculation length) in each SD round. Prior works use hyperparameter exploration to statically select these values. However, our evaluations show that these hyperparameter values are task-specific, and even within a task they are dependent on the current sequence context. We introduce DEL (Dynamic Exit Layer), a plug-and-play method that adaptively selects the exit layer and speculation length during inference. DEL dynamically tracks the token acceptance rate if the tokens are drafted at each layer of an LLM and uses that knowledge to heuristically select the optimal exit layer and speculation length. Our experiments across a broad range of models and downstream tasks show that DEL achieves overall speedups of $2.16\times$$\sim$$2.62\times$ over vanilla auto-regressive decoding and improves upon state-of-the-art SD methods, which peak at $2.43\times$, by up to $0.19\times$. The code is available at https://github.com/hoenza/DEL.
推测解码通过使用草稿模型高效生成多个候选标记,再由目标模型并行验证,从而在不降低生成质量的前提下加速大语言模型的推理过程。
Speculative Decoding accelerates large language model inference by using a draft model to propose tokens and the target model to verify them in parallel, maintaining quality while increasing speed.

Authors:P Team, Siwei Wu, Jincheng Ren, Xinrun Du, Shuyue Guo, Xingwei Qu, Yiming Liang, Jie Liu, Yunwen Li, Tianyu Zheng, Boyu Feng, Huaqing Yuan, Zenith Wang, Jiaheng Liu, Wenhao Huang, Chenglin Cai, Haoran Que, Jian Yang, Yuelin Bai, Zekun Moore Wang, Zhouliang Yu, Qunshu Lin, Ding Pan, Yuchen Jiang, Tiannan Wang, Wangchunshu Zhou, Shenzhi Wang, Xingyuan Bu, Minghao Liu, Guoyin Wang, Ge Zhang, Chenghua Lin
Title: COIG-P: A High-Quality and Large-Scale Chinese Preference Dataset for Alignment with Human Values
Abstract:
Aligning large language models (LLMs) with human preferences has achieved remarkable success. However, existing Chinese preference datasets are limited by small scale, narrow domain coverage, and lack of rigorous data validation. Additionally, the reliance on human annotators for instruction and response labeling significantly constrains the scalability of human preference datasets. To address these challenges, we design an LLM-based Chinese preference dataset annotation pipeline with no human intervention. Specifically, we crawled and carefully filtered 92k high-quality Chinese queries and employed 15 mainstream LLMs to generate and score chosen-rejected response pairs. Based on it, we introduce COIG-P (Chinese Open Instruction Generalist - Preference), a high-quality, large-scale Chinese preference dataset, comprises 1,009k Chinese preference pairs spanning 6 diverse domains: Chat, Code, Math, Logic, Novel, and Role. Building upon COIG-P, to reduce the overhead of using LLMs for scoring, we trained a 8B-sized Chinese Reward Model (CRM) and meticulously constructed a Chinese Reward Benchmark (CRBench). Evaluation results based on AlignBench \citep{liu2024alignbenchbenchmarkingchinesealignment} show that that COIG-P significantly outperforms other Chinese preference datasets, and it brings significant performance improvements ranging from 2% to 12% for the Qwen2/2.5 and Infinity-Instruct-3M-0625 model series, respectively. The results on CRBench demonstrate that our CRM has a strong and robust scoring ability. We apply it to filter chosen-rejected response pairs in a test split of COIG-P, and our experiments show that it is comparable to GPT-4o in identifying low-quality samples while maintaining efficiency and cost-effectiveness. Our codes and data are released in https://github.com/multimodal-art-projection/COIG-P.
Chinese Summary: 本研究通过全自动LLM标注流程构建了大规模中文偏好数据集COIG-P,有效解决了现有数据集规模小、领域窄的问题,实验证明其能显著提升模型性能并训练出高效的中文奖励模型。
English Summary: The study introduces COIG-P, a large-scale Chinese preference dataset created using an automated LLM-based pipeline to overcome limitations of existing datasets, and demonstrates its effectiveness through improved model performance and a robust reward model.

Authors:Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, Megh Thakkar, Md Rizwan Parvez, Enamul Hoque, Shafiq Joty
Title: ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering
Abstract:
Charts are ubiquitous, as people often use them to analyze data, answer questions, and discover critical insights. However, performing complex analytical tasks with charts requires significant perceptual and cognitive effort. Chart Question Answering (CQA) systems automate this process by enabling models to interpret and reason with visual representations of data. However, existing benchmarks like ChartQA lack real-world diversity and have recently shown performance saturation with modern large vision-language models (LVLMs). To address these limitations, we introduce ChartQAPro, a new benchmark that includes 1,341 charts from 157 diverse sources, spanning various chart types, including infographics and dashboards, and featuring 1,948 questions in various types, such as multiple-choice, conversational, hypothetical, and unanswerable questions, to better reflect real-world challenges. Our evaluations with 21 models show a substantial performance drop for LVLMs on ChartQAPro; e.g., Claude Sonnet 3.5 scores 90.5% on ChartQA but only 55.81% on ChartQAPro, underscoring the complexity of chart reasoning. We complement our findings with detailed error analyses and ablation studies, identifying key challenges and opportunities for advancing LVLMs in chart understanding and reasoning. We release ChartQAPro at https://github.com/vis-nlp/ChartQAPro.
中文: 针对现有图表问答基准的不足,ChartQAPro引入了包含1341个多样化图表和1948个复杂问题的新数据集,揭示了大型视觉语言模型性能显著下降,并指出了提升图表推理能力的关键挑战。
English: To address the limitations of existing chart question answering benchmarks, ChartQAPro introduces a diverse dataset of 1,341 charts and 1,948 complex questions, revealing significant performance drops in large vision-language models and highlighting key challenges for advancing chart reasoning capabilities.

Authors:Yang Yan, Yu Lu, Renjun Xu, Zhenzhong Lan
Title: Do Large Language Models Truly Grasp Addition? A Rule-Focused Diagnostic Using Two-Integer Arithmetic
Abstract:
Large language models (LLMs) achieve impressive results on advanced mathematics benchmarks but sometimes fail on basic arithmetic tasks, raising the question of whether they have truly grasped fundamental arithmetic rules or are merely relying on pattern matching. To unravel this issue, we systematically probe LLMs' understanding of two-integer addition ($0$ to $2^{64}$) by testing three crucial properties: commutativity ($A+B=B+A$), representation invariance via symbolic remapping (e.g., $7 \mapsto Y$), and consistent accuracy scaling with operand length. Our evaluation of 12 leading LLMs reveals a stark disconnect: while models achieve high numeric accuracy (73.8-99.8%), they systematically fail these diagnostics. Specifically, accuracy plummets to $\le 7.5$% with symbolic inputs, commutativity is violated in up to 20% of cases, and accuracy scaling is non-monotonic. Interventions further expose this pattern-matching reliance: explicitly providing rules degrades performance by 29.49%, while prompting for explanations before answering merely maintains baseline accuracy. These findings demonstrate that current LLMs address elementary addition via pattern matching, not robust rule induction, motivating new diagnostic benchmarks and innovations in model architecture and training to cultivate genuine mathematical reasoning. Our dataset and generating code are available at https://github.com/kuri-leo/llm-arithmetic-diagnostic.
中文: 大语言模型在基础算术上数值准确率高,但在交换律和符号不变性等基本属性诊断中表现不佳,表明其依赖模式匹配而非真正的规则理解。
English: Large language models achieve high numeric accuracy in basic arithmetic but fail diagnostic tests for fundamental properties like commutativity and symbolic invariance, revealing reliance on pattern matching rather than genuine rule understanding.

Authors:Geyang Guo, Tarek Naous, Hiromi Wakaki, Yukiko Nishimura, Yuki Mitsufuji, Alan Ritter, Wei Xu
Title: CARE: Multilingual Human Preference Learning for Cultural Awareness
Abstract:
Language Models (LMs) are typically tuned with human preferences to produce helpful responses, but the impact of preference tuning on the ability to handle culturally diverse queries remains understudied. In this paper, we systematically analyze how native human cultural preferences can be incorporated into the preference learning process to train more culturally aware LMs. We introduce \textbf{CARE}, a multilingual resource containing 3,490 culturally specific questions and 31.7k responses with human judgments. We demonstrate how a modest amount of high-quality native preferences improves cultural awareness across various LMs, outperforming larger generic preference data. Our analyses reveal that models with stronger initial cultural performance benefit more from alignment, leading to gaps among models developed in different regions with varying access to culturally relevant data. CARE is publicly available at https://github.com/Guochry/CARE.
中文摘要:通过引入本土文化偏好优化语言模型的偏好调整,能比使用更大规模通用数据更有效地提升其文化感知能力,CARE资源验证了这一点。
English Summary: Preference tuning for language models is enhanced by incorporating native cultural preferences, improving their cultural awareness more effectively than using larger generic datasets, as demonstrated by the CARE resource.

Authors:Liu Xiao, Li Zhiyuan, Lin Yueyu
Title: State Tuning: State-based Test-Time Scaling on RWKV-7
Abstract:
Test-time scaling has emerged as a prominent research direction in machine learning, enabling models to enhance their expressive capabilities during inference.Transformers, renowned for striking a delicate balance between efficiency and expressiveness, have benefited from test-time scaling techniques that leverage an expanding key-value (KV) cache to significantly improve performance.In this paper, we introduce a novel state-based approach to test-time scaling, which we term state tuning, tailored to the RNN-based RWKV-7 model.By exploiting the unique strengths of RWKV-7, our method achieves state-of-the-art performance on the target task without altering the model's pre-trained weights. Our approach centers on three key innovations. First, we develop an observer framework that allows a smaller model to replicate and learn the state dynamics of the RWKV-7 model. Second, we employ a kernel method to dynamically upscale the state size, enhancing the model's capacity to capture intricate patterns. Third, we integrate Decorrelated Backpropagation (DBP) to optimize the upscaled state matrix, thereby improving convergence and expressivity. By tuning only the state matrix, we demonstrate that a smaller model can outperform larger models on the given task. This method preserves the efficiency of the original RWKV-7 architecture while harnessing the power of test-time scaling to deliver superior results. Our findings underscore the potential of state tuning as an effective strategy for advancing model performance in resource-constrained settings. Our code is https://github.com/TorchRWKV/flash-linear-attention.
中文: 本文提出状态调优这一新型测试时扩展方法,针对RWKV-7模型通过动态扩展和优化状态矩阵(无需修改预训练权重)实现性能突破,其三大核心创新——观察者框架、核函数状态扩展与去相关反向传播——使小模型在特定任务上超越大模型表现。
English: This paper introduces state tuning, a novel test-time scaling method for the RWKV-7 model that enhances performance by dynamically upscaling and optimizing the state matrix without altering pre-trained weights, achieving state-of-the-art results through three key innovations: an observer framework, kernel-based state upscaling, and Decorrelated Backpropagation.

Authors:Ran Xu, Wenqi Shi, Yuchen Zhuang, Yue Yu, Joyce C. Ho, Haoyu Wang, Carl Yang
Title: Collab-RAG: Boosting Retrieval-Augmented Generation for Complex Question Answering via White-Box and Black-Box LLM Collaboration
Abstract:
Retrieval-Augmented Generation (RAG) systems often struggle to handle multi-hop question-answering tasks accurately due to irrelevant context retrieval and limited complex reasoning capabilities. We introduce Collab-RAG, a collaborative training framework that leverages mutual enhancement between a white-box small language model (SLM) and a blackbox large language model (LLM) for RAG. Specifically, the SLM decomposes complex queries into simpler sub-questions, thus enhancing the accuracy of the retrieval and facilitating more effective reasoning by the black-box LLM. Concurrently, the black-box LLM provides feedback signals to improve the SLM's decomposition capability. We observe that Collab-RAG relies solely on supervision from an affordable black-box LLM without additional distillation from frontier LLMs, yet demonstrates strong generalization across multiple black-box LLMs. Experimental evaluations across five multi-hop QA datasets demonstrate that Collab-RAG substantially outperforms existing black-box-only and SLM fine-tuning baselines by 1.8%-14.2% on average. In particular, our fine-tuned 3B SLM surpasses a frozen 32B LLM in question decomposition, highlighting the efficiency of Collab-RAG in improving reasoning and retrieval for complex questions. The code of Collab-RAG is available on https://github.com/ritaranx/Collab-RAG/.
中文:Collab-RAG通过让小型语言模型分解复杂问题、大型语言模型提供反馈,有效提升了多跳问答性能,无需昂贵蒸馏即可实现卓越表现。
English: Collab-RAG enhances multi-hop question-answering by enabling a small language model to decompose complex queries and a large language model to provide feedback, achieving superior performance without expensive distillation.

Authors:Ruikang Liu, Yuxuan Sun, Manyi Zhang, Haoli Bai, Xianzhi Yu, Tiezheng Yu, Chun Yuan, Lu Hou
Title: Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models
Abstract:
Recent advancements in reasoning language models have demonstrated remarkable performance in complex tasks, but their extended chain-of-thought reasoning process increases inference overhead. While quantization has been widely adopted to reduce the inference cost of large language models, its impact on reasoning models remains understudied. In this paper, we conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families ranging from 1.5B to 70B parameters, QwQ-32B, and Qwen3-8B. Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths, with extensive evaluation across mathematical (AIME, MATH-500), scientific (GPQA), and programming (LiveCodeBench) reasoning benchmarks. Our findings reveal that while lossless quantization can be achieved with W8A8 or W4A16 quantization, lower bit-widths introduce significant accuracy risks. We further identify model size, model origin, and task difficulty as critical determinants of performance. Contrary to expectations, quantized models do not exhibit increased output lengths. In addition, strategically scaling the model sizes or reasoning steps can effectively enhance the performance. All quantized models and codes are open-sourced in https://github.com/ruikangliu/Quantized-Reasoning-Models.
中文: 近期推理语言模型的进展面临高推理成本,本研究系统评估了量化的影响,发现W8A8或W4A16量化可实现无损压缩,但更低比特位宽会带来精度风险,模型规模和任务难度是关键影响因素。
English: Recent advances in reasoning language models face high inference costs, and this study systematically evaluates quantization's impact, finding that while W8A8 or W4A16 can achieve lossless compression, lower bit-widths risk accuracy, with model size and task difficulty being key factors.

Authors:Yubo Li, Xiaobin Shen, Xinyu Yao, Xueying Ding, Yidi Miao, Ramayya Krishnan, Rema Padman
Title: Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models
Abstract:
Recent advancements in large language models (LLMs) have revolutionized their ability to handle single-turn tasks, yet real-world applications demand sophisticated multi-turn interactions. This survey provides a comprehensive review of recent advancements in evaluating and enhancing multi-turn interactions in LLMs. Focusing on task-specific scenarios, from instruction following in diverse domains such as math and coding to complex conversational engagements in roleplay, healthcare, education, and even adversarial jailbreak settings, we systematically examine the challenges of maintaining context, coherence, fairness, and responsiveness over prolonged dialogues. The paper organizes current benchmarks and datasets into coherent categories that reflect the evolving landscape of multi-turn dialogue evaluation. In addition, we review a range of enhancement methodologies under multi-turn settings, including model-centric strategies (contextual learning, supervised fine-tuning, reinforcement learning, and new architectures), external integration approaches (memory-augmented, retrieval-based methods, and knowledge graph), and agent-based techniques for collaborative interactions. Finally, we discuss open challenges and propose future directions for research to further advance the robustness and effectiveness of multi-turn interactions in LLMs. Related resources and papers are available at https://github.com/yubol-cmu/Awesome-Multi-Turn-LLMs.
中文摘要:本综述系统回顾了大语言模型多轮交互评估与增强的最新进展,涵盖多领域任务场景下的基准测试、优化方法及未来研究方向。
English Summary: This survey comprehensively reviews recent progress in evaluating and improving multi-turn interactions in large language models, covering benchmarks, enhancement methods, and future research directions across various task-specific scenarios.

Authors:Will Cai, Tianneng Shi, Xuandong Zhao, Dawn Song
Title: Are You Getting What You Pay For? Auditing Model Substitution in LLM APIs
Abstract:
The proliferation of Large Language Models (LLMs) accessed via black-box APIs introduces a significant trust challenge: users pay for services based on advertised model capabilities (e.g., size, performance), but providers may covertly substitute the specified model with a cheaper, lower-quality alternative to reduce operational costs. This lack of transparency undermines fairness, erodes trust, and complicates reliable benchmarking. Detecting such substitutions is difficult due to the black-box nature, typically limiting interaction to input-output queries. This paper formalizes the problem of model substitution detection in LLM APIs. We systematically evaluate existing verification techniques, including output-based statistical tests, benchmark evaluations, and log probability analysis, under various realistic attack scenarios like model quantization, randomized substitution, and benchmark evasion. Our findings reveal the limitations of methods relying solely on text outputs, especially against subtle or adaptive attacks. While log probability analysis offers stronger guarantees when available, its accessibility is often limited. We conclude by discussing the potential of hardware-based solutions like Trusted Execution Environments (TEEs) as a pathway towards provable model integrity, highlighting the trade-offs between security, performance, and provider adoption. Code is available at https://github.com/sunblaze-ucb/llm-api-audit
中文摘要:商业大语言模型API存在信任问题,服务商可能暗中替换廉价模型,软件检测方法效果不佳,而基于可信执行环境的硬件级安全方案能以较小性能开销提供可靠解决方案。
English Summary: Commercial LLM APIs face a trust issue where providers may secretly substitute cheaper models, and software detection methods prove unreliable, but hardware-level security using Trusted Execution Environments offers a robust solution with minimal performance impact.

Authors:Manlai Liang, JiaMing Zhang, Xiong Li, Jinlong Li
Title: LagKV: Lag-Relative Information of the KV Cache Tells Which Tokens Are Important
Abstract:
The increasing size of the Key-Value (KV) cache during the Large Language Models long-context inference is the main obstacle for its balance between the deployment cost and task accuracy. To reduce the KV cache size in such scenarios, most previous efforts leveraged on the attention weight to evict non-critical cache tokens. But there is a trade-off in those methods, they usually require major modification of the inference infrastructure and significant computation overhead. Based on the fact that the Large Language models are autoregressive models, we propose LagKV, a KV compression strategy only relying on straight forward comparison among KV themselves. It is a totally attention free method which offers easy integration to the main stream inference platform and comparable performance comparing to other complicated KV compression methods. Results on RULER benchmark show that, our approach outperforms SnapKV and StreamingLLM in different compression ratios. Especially in the 64-digit passkey retrieval task, our method outperforms the attention weight based method $H_2O$ over $50\%$ with same compression ratios. Our code is available at https://github.com/AI-Lab-China-Merchants-Bank/LagKV.
中文摘要:LagKV是一种创新的KV缓存压缩方法,通过直接比较KV条目而无需注意力权重,实现了便捷集成和在长上下文大语言模型推理中的卓越性能。
English Summary: LagKV is a novel KV cache compression method that eliminates the need for attention weights by comparing KV entries directly, offering easy integration and superior performance in long-context LLM inference.

Authors:Archana Sahu, Plaban Kumar Bhowmick
Title: Directed Graph-alignment Approach for Identification of Gaps in Short Answers
Abstract:
In this paper, we have presented a method for identifying missing items known as gaps in the student answers by comparing them against the corresponding model answer/reference answers, automatically. The gaps can be identified at word, phrase or sentence level. The identified gaps are useful in providing feedback to the students for formative assessment. The problem of gap identification has been modelled as an alignment of a pair of directed graphs representing a student answer and the corresponding model answer for a given question. To validate the proposed approach, the gap annotated student answers considering answers from three widely known datasets in the short answer grading domain, namely, University of North Texas (UNT), SciEntsBank, and Beetle have been developed and this gap annotated student answers' dataset is available at: https://github.com/sahuarchana7/gaps-answers-dataset. Evaluation metrics used in the traditional machine learning tasks have been adopted to evaluate the task of gap identification. Though performance of the proposed approach varies across the datasets and the types of the answers, overall the performance is observed to be promising.
本文提出了一种通过将学生答案与标准答案进行有向图对齐来自动检测其中缺失内容的方法,该方法在多个数据集上表现出良好的性能,适用于形成性评价的反馈环节。
This paper introduces an automated method for detecting gaps in student answers by aligning them with model answers using directed graphs, which proves effective for formative feedback across multiple datasets.

Authors:Weiwei Sun, Shengyu Feng, Shanda Li, Yiming Yang
Title: CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization
Abstract:
Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap underscores the need for a deeper understanding of their potential in tackling structured, constraint-intensive problems -- a pursuit currently limited by the absence of comprehensive benchmarks for systematic investigation. To address this, we introduce CO-Bench, a benchmark suite featuring 36 real-world CO problems drawn from a broad range of domains and complexity levels. CO-Bench includes structured problem formulations and curated data to support rigorous investigation of LLM agents. We evaluate multiple agentic frameworks against established human-designed algorithms, revealing the strengths and limitations of existing LLM agents and identifying promising directions for future research. CO-Bench is publicly available at https://github.com/sunnweiwei/CO-Bench.
Chinese: 我们推出了CO-Bench这一包含36个现实组合优化问题的基准测试套件,通过系统评估LLM智能体的表现,揭示了其相对于传统算法的优势与不足,为未来研究指明方向。
English: CO-Bench is introduced as a comprehensive benchmark suite with 36 real-world combinatorial optimization problems to systematically evaluate LLM agents' capabilities, revealing their strengths and limitations compared to traditional algorithms.

Authors:Yuantao Zhang, Zhankui Yang
Title: A Perplexity and Menger Curvature-Based Approach for Similarity Evaluation of Large Language Models
Abstract:
The rise of Large Language Models (LLMs) has brought about concerns regarding copyright infringement and unethical practices in data and model usage. For instance, slight modifications to existing LLMs may be used to falsely claim the development of new models, leading to issues of model copying and violations of ownership rights. This paper addresses these challenges by introducing a novel metric for quantifying LLM similarity, which leverages perplexity curves and differences in Menger curvature. Comprehensive experiments validate the performance of our methodology, demonstrating its superiority over baseline methods and its ability to generalize across diverse models and domains. Furthermore, we highlight the capability of our approach in detecting model replication through simulations, emphasizing its potential to preserve the originality and integrity of LLMs. Code is available at https://github.com/zyttt-coder/LLM_similarity.
中文摘要:本文提出了一种利用困惑度曲线和Menger曲率的新型指标来量化大语言模型相似度,通过实验验证能有效检测模型复制行为,维护模型原创性与完整性。
English Summary: This paper introduces a novel metric using perplexity curves and Menger curvature to quantify LLM similarity, effectively detecting model replication and preserving originality through validated experiments.

Authors:Aviv Brokman, Xuguang Ai, Yuhang Jiang, Shashank Gupta, Ramakanth Kavuluru
Title: A Benchmark for End-to-End Zero-Shot Biomedical Relation Extraction with LLMs: Experiments with OpenAI Models
Abstract:
Objective: Zero-shot methodology promises to cut down on costs of dataset annotation and domain expertise needed to make use of NLP. Generative large language models trained to align with human goals have achieved high zero-shot performance across a wide variety of tasks. As of yet, it is unclear how well these models perform on biomedical relation extraction (RE). To address this knowledge gap, we explore patterns in the performance of OpenAI LLMs across a diverse sampling of RE tasks. Methods: We use OpenAI GPT-4-turbo and OpenAI's reasoning models o1 and GPT-OSS to conduct end-to-end RE experiments on seven datasets. We use the JSON generation capabilities of GPT models to generate structured output in two ways: (1) by defining an explicit schema describing the structure of relations, and (2) using a setting that infers the structure from the prompt language. Results: Our work is the first to study and compare the performance of the GPT-4, o1 and GPT-OSS for the end-to-end zero-shot biomedical RE task across a broad array of datasets. We found the zero-shot performances to be proximal to that of fine-tuned methods. The limitations of this approach are that it performs poorly on instances containing many relations and errs on the boundaries of textual mentions. Conclusion: LLMs exhibit promising zero-shot capabilities in complex biomedical RE tasks, offering competitive performance with reduced dataset curation costs and NLP modeling needs but with increased perpetual compute costs. Addressing the limitations we identify could further boost reliability. The code, data, and prompts for all our experiments are publicly available for additional benchmarking by the community: https://github.com/bionlproc/ZeroShotRE
中文摘要:大型语言模型在生物医学关系抽取任务中展现出有竞争力的零样本性能,为传统方法提供了一种成本效益高的替代方案,但需要持续的计算资源投入。
English summary: Large language models demonstrate competitive zero-shot performance in biomedical relation extraction tasks, offering a cost-effective alternative to traditional methods while requiring ongoing computational resources.

Authors:Bing Wang, Bingrui Zhao, Ximing Li, Changchun Li, Wanfu Gao, Shengsheng Wang
Title: Collaboration and Controversy Among Experts: Rumor Early Detection by Tuning a Comment Generator
Abstract:
Over the past decade, social media platforms have been key in spreading rumors, leading to significant negative impacts. To counter this, the community has developed various Rumor Detection (RD) algorithms to automatically identify them using user comments as evidence. However, these RD methods often fail in the early stages of rumor propagation when only limited user comments are available, leading the community to focus on a more challenging topic named Rumor Early Detection (RED). Typically, existing RED methods learn from limited semantics in early comments. However, our preliminary experiment reveals that the RED models always perform best when the number of training and test comments is consistent and extensive. This inspires us to address the RED issue by generating more human-like comments to support this hypothesis. To implement this idea, we tune a comment generator by simulating expert collaboration and controversy and propose a new RED framework named CAMERED. Specifically, we integrate a mixture-of-expert structure into a generative language model and present a novel routing network for expert collaboration. Additionally, we synthesize a knowledgeable dataset and design an adversarial learning strategy to align the style of generated comments with real-world comments. We further integrate generated and original comments with a mutual controversy fusion module. Experimental results show that CAMERED outperforms state-of-the-art RED baseline models and generation methods, demonstrating its effectiveness.
中文摘要:本研究提出CAMERED框架,通过模拟专家协作生成拟真用户评论来增强谣言早期检测能力,实验证明其性能优于现有最优模型。
English Summary: The study introduces CAMERED, a novel framework for Rumor Early Detection that enhances detection accuracy by generating realistic user comments through expert collaboration simulation and adversarial learning, outperforming existing methods.

Authors:Yuhao Wang, Heyang Liu, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Title: VocalNet: Speech LLM with Multi-Token Prediction for Faster and High-Quality Generation
Abstract:
Speech large language models (LLMs) have emerged as a prominent research focus in speech processing. We introduce VocalNet-1B and VocalNet-8B, a series of high-performance, low-latency speech LLMs enabled by a scalable and model-agnostic training framework designed for real-time voice interaction. Central to our contribution is the first application of multi-token prediction (MTP) to speech LLMs. This approach represents a paradigm shift from standard next-token prediction (NTP), offering simultaneous improvements in generation speed and quality. Informed by analysis of MTP's effect on speech generation and experimental comparisons, we designed a straightforward and highly effective MTP implementation. Experiments demonstrate that VocalNet performs on par with mainstream Omni LLMs even with limited training data, and significantly surpasses existing open-source speech LLMs. To foster reproducibility and community advancement, all model weights, inference code, training data, and framework implementations have been made publicly available at https://github.com/SJTU-OmniAgent/VocalNet
中文摘要:VocalNet通过可扩展且模型无关的训练框架,首次将多令牌预测应用于语音大语言模型,实现了高性能、低延迟的实时语音交互,在有限训练数据下性能媲美主流模型并显著超越现有开源方案。
English Summary: VocalNet introduces a scalable, model-agnostic training framework using multi-token prediction to create high-performance, low-latency speech LLMs that match mainstream models with limited data while surpassing existing open-source alternatives.

Authors:Dahun Kim, AJ Piergiovanni, Ganesh Mallya, Anelia Angelova
Title: VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Abstract:
We introduce VideoComp, a benchmark and learning framework for advancing video-text compositionality understanding, aimed at improving vision-language models (VLMs) in fine-grained temporal alignment. Unlike existing benchmarks focused on static image-text compositionality or isolated single-event videos, our benchmark targets alignment in continuous multi-event videos. Leveraging video-text datasets with temporally localized event captions (e.g. ActivityNet-Captions, YouCook2), we construct two compositional benchmarks, ActivityNet-Comp and YouCook2-Comp. We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions. These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences. To improve model performance, we propose a hierarchical pairwise preference loss that strengthens alignment with temporally accurate pairs and gradually penalizes increasingly disrupted ones, encouraging fine-grained compositional learning. To mitigate the limited availability of densely annotated video data, we introduce a pretraining strategy that concatenates short video-caption pairs to simulate multi-event sequences. We evaluate video-text foundational models and large multimodal models (LMMs) on our benchmark, identifying both strengths and areas for improvement in compositionality. Overall, our work provides a comprehensive framework for evaluating and enhancing model capabilities in achieving fine-grained, temporally coherent video-text alignment.
中文: 我们推出VideoComp基准和学习框架,旨在通过多事件视频中的时序对齐测试及分层成对偏好损失与预训练策略,提升视觉语言模型在视频文本组合性理解方面的细粒度能力。
English: We introduce VideoComp, a benchmark and framework for enhancing video-text compositionality in vision-language models by testing temporal alignment through challenging disruptions in multi-event videos and improving performance with a hierarchical pairwise loss and pretraining strategy.

Authors:Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang
Title: TDBench: Benchmarking Vision-Language Models in Understanding Top-Down Images
Abstract:
The rapid emergence of Vision-Language Models (VLMs) has significantly advanced multimodal understanding, enabling applications in scene comprehension and visual reasoning. While these models have been primarily evaluated and developed for front-view image understanding, their capabilities in interpreting top-down images have received limited attention, partly due to the scarcity of diverse top-down datasets and the challenges in collecting such data. In contrast, top-down vision provides explicit spatial overviews and improved contextual understanding of scenes, making it particularly valuable for tasks like autonomous navigation, aerial imaging, and spatial planning. In this work, we address this gap by introducing TDBench, a comprehensive benchmark for VLMs in top-down image understanding. TDBench is constructed from public top-down view datasets and high-quality simulated images, including diverse real-world and synthetic scenarios. TDBench consists of visual question-answer pairs across ten evaluation dimensions of image understanding. Moreover, we conduct four case studies that commonly happen in real-world scenarios but are less explored. By revealing the strengths and limitations of existing VLM through evaluation results, we hope TDBench to provide insights for motivating future research. Project homepage: https://github.com/Columbia-ICSL/TDBench
Chinese Summary: 本研究提出TDBench基准测试,通过创新的评估框架和案例研究,针对俯视图像评估视觉语言模型,解决其旋转不变性和可靠性等被忽视的问题。
English Summary: The study introduces TDBench, a benchmark for evaluating Vision Language Models on top-down images, addressing their overlooked rotational invariance and reliability issues through a novel evaluation framework and case studies.

Authors:Zhiqiang Wang, Pengbin Feng, Yanbin Lin, Shuzhang Cai, Zongao Bian, Jinghua Yan, Xingquan Zhu
Title: CrowdVLM-R1: Expanding R1 Ability to Vision Language Model for Crowd Counting using Fuzzy Group Relative Policy Reward
Abstract:
We propose Fuzzy Group Relative Policy Reward (FGRPR), a novel framework that integrates Group Relative Policy Optimization (GRPO) with a fuzzy reward function to enhance learning efficiency. Unlike the conventional binary 0/1 accuracy reward, our fuzzy reward model provides nuanced incentives, encouraging more precise outputs. Experimental results demonstrate that GRPO with a standard 0/1 accuracy reward underperforms compared to supervised fine-tuning (SFT). In contrast, FGRPR, applied to Qwen2.5-VL(3B and 7B), surpasses all baseline models, including GPT4o, LLaMA2(90B), and SFT, across five in-domain datasets. On an out-of-domain dataset, FGRPR achieves performance comparable to SFT but excels when target values are larger, as its fuzzy reward function assigns higher rewards to closer approximations. This approach is broadly applicable to tasks where the precision of the answer is critical. Code and data: https://github.com/yeyimilk/CrowdVLM-R1
我们提出了FGRPR框架,将GRPO与模糊奖励函数结合以提高学习效率,通过提供细致激励促进精确输出,在多个数据集上超越了包括GPT4o和SFT在内的基线模型。
We propose FGRPR, a framework combining GRPO with a fuzzy reward function to improve learning efficiency, outperforming baseline models including GPT4o and SFT across datasets by providing nuanced incentives for precise outputs.

Authors:Zae Myung Kim, Anand Ramachandran, Farideh Tavazoee, Joo-Kyung Kim, Oleg Rokhlenko, Dongyeop Kang
Title: Align to Structure: Aligning Large Language Models with Structural Information
Abstract:
Generating long, coherent text remains a challenge for large language models (LLMs), as they lack hierarchical planning and structured organization in discourse generation. We introduce Structural Alignment, a novel method that aligns LLMs with human-like discourse structures to enhance long-form text generation. By integrating linguistically grounded discourse frameworks into reinforcement learning, our approach guides models to produce coherent and well-organized outputs. We employ a dense reward scheme within a Proximal Policy Optimization framework, assigning fine-grained, token-level rewards based on the discourse distinctiveness relative to human writing. Two complementary reward models are evaluated: the first improves readability by scoring surface-level textual features to provide explicit structuring, while the second reinforces deeper coherence and rhetorical sophistication by analyzing global discourse patterns through hierarchical discourse motifs, outperforming both standard and RLHF-enhanced models in tasks such as essay generation and long-document summarization. All training data and code will be publicly shared at https://github.com/minnesotanlp/struct_align.
中文:结构对齐方法通过强化学习将大型语言模型与人类话语结构对齐,利用细粒度奖励提升文本连贯性,在文章生成和长文档摘要等任务中优于现有模型。
English: Structural Alignment enhances long-form text generation in LLMs by aligning them with human discourse structures through reinforcement learning, using token-level rewards to improve coherence and outperforming existing models in tasks like essay writing and summarization.

Authors:Peter Baile Chen, Tomer Wolfson, Michael Cafarella, Dan Roth
Title: EnrichIndex: Using LLMs to Enrich Retrieval Indices Offline
Abstract:
Existing information retrieval systems excel in cases where the language of target documents closely matches that of the user query. However, real-world retrieval systems are often required to implicitly reason whether a document is relevant. For example, when retrieving technical texts or tables, their relevance to the user query may be implied through a particular jargon or structure, rather than explicitly expressed in their content. Large language models (LLMs) hold great potential in identifying such implied relevance by leveraging their reasoning skills. Nevertheless, current LLM-augmented retrieval is hindered by high latency and computation cost, as the LLM typically computes the query-document relevance online, for every query anew. To tackle this issue we introduce EnrichIndex, a retrieval approach which instead uses the LLM offline to build semantically-enriched retrieval indices, by performing a single pass over all documents in the retrieval corpus once during ingestion time. Furthermore, the semantically-enriched indices can complement existing online retrieval approaches, boosting the performance of LLM re-rankers. We evaluated EnrichIndex on five retrieval tasks, involving passages and tables, and found that it outperforms strong online LLM-based retrieval systems, with an average improvement of 11.7 points in recall @ 10 and 10.6 points in NDCG @ 10 compared to strong baselines. In terms of online calls to the LLM, it processes 293.3 times fewer tokens which greatly reduces the online latency and cost. Overall, EnrichIndex is an effective way to build better retrieval indices offline by leveraging the strong reasoning skills of LLMs.

Authors:Runnan Fang, Xiaobin Wang, Yuan Liang, Shuofei Qiao, Jialong Wu, Zekun Xi, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Abstract:
In the interaction between agents and their environments, agents expand their capabilities by planning and executing actions. However, LLM-based agents face substantial challenges when deployed in novel environments or required to navigate unconventional action spaces. To empower agents to autonomously explore environments, optimize workflows, and enhance their understanding of actions, we propose SynWorld, a framework that allows agents to synthesize possible scenarios with multi-step action invocation within the action space and perform Monte Carlo Tree Search (MCTS) exploration to effectively refine their action knowledge in the current environment. Our experiments demonstrate that SynWorld is an effective and general approach to learning action knowledge in new environments. Code is available at https://github.com/zjunlp/SynWorld.
中文摘要:SynWorld框架通过合成多步动作场景和执行蒙特卡洛树搜索探索,使基于大语言模型的智能体能够自主优化工作流程并增强动作理解,实验证明其在新环境中学习动作知识的有效性。
English Summary: SynWorld is a framework that enables LLM-based agents to autonomously explore novel environments by synthesizing scenarios and using Monte Carlo Tree Search to refine their action knowledge, proving effective in experimental evaluations.

Authors:Shuofei Qiao, Zhisong Qiu, Baochang Ren, Xiaobin Wang, Xiangyuan Ru, Ningyu Zhang, Xiang Chen, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
Title: Agentic Knowledgeable Self-awareness
Abstract:
Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.
中文摘要:提出的KnowSelf范式通过情境感知使基于大语言模型的智能体能够自主调控知识运用,在不同任务中以最少的外部知识实现更优的规划效果。
English Summary: The proposed KnowSelf paradigm enables LLM-based agents to autonomously regulate knowledge usage through situational awareness, achieving superior planning performance with minimal external knowledge across various tasks.

Authors:Khai Le-Duc, Tuyen Tran, Bach Phan Tat, Nguyen Kim Hai Bui, Quan Dang, Hung-Phong Tran, Thanh-Thuy Nguyen, Ly Nguyen, Tuan-Minh Phan, Thi Thu Phuong Tran, Chris Ngo, Nguyen X. Khanh, Thanh Nguyen-Tang
Title: MultiMed-ST: Large-scale Many-to-many Multilingual Medical Speech Translation
Abstract:
Multilingual speech translation (ST) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, Traditional Chinese and Simplified Chinese, together with the models. With 290,000 samples, our dataset is the largest medical machine translation (MT) dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most extensive analysis study in ST research to date, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence (seq2seq) comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: https://github.com/leduckhai/MultiMed-ST.
中文摘要:本研究推出了首个医疗领域大规模多语言语音翻译数据集MultiMed-ST,包含29万条五语言样本,并通过全面对比分析推动跨语言医疗交流的发展。
English Summary: This study introduces MultiMed-ST, the first large-scale multilingual speech translation dataset for the medical domain, featuring 290,000 samples across five languages and comprehensive comparative analyses to advance cross-lingual healthcare communication.

Authors:Kaustubh Shivshankar Shejole, Pushpak Bhattacharyya
Title: StereoDetect: Detecting Stereotypes and Anti-stereotypes the Correct Way Using Social Psychological Underpinnings
Abstract:
Stereotypes are known to have very harmful effects, making their detection critically important. However, current research predominantly focuses on detecting and evaluating stereotypical biases, thereby leaving the study of stereotypes in its early stages. Our study revealed that many works have failed to clearly distinguish between stereotypes and stereotypical biases, which has significantly slowed progress in advancing research in this area. Stereotype and Anti-stereotype detection is a problem that requires social knowledge; hence, it is one of the most difficult areas in Responsible AI. This work investigates this task, where we propose a five-tuple definition and provide precise terminologies disentangling stereotypes, anti-stereotypes, stereotypical bias, and general bias. We provide a conceptual framework grounded in social psychology for reliable detection. We identify key shortcomings in existing benchmarks for this task of stereotype and anti-stereotype detection. To address these gaps, we developed StereoDetect, a well curated, definition-aligned benchmark dataset designed for this task. We show that sub-10B language models and GPT-4o frequently misclassify anti-stereotypes and fail to recognize neutral overgeneralizations. We demonstrate StereoDetect's effectiveness through multiple qualitative and quantitative comparisons with existing benchmarks and models fine-tuned on them. The dataset and code is available at https://github.com/KaustubhShejole/StereoDetect.
This study addresses the critical need to distinguish stereotypes from stereotypical biases in AI by proposing a clear five-tuple definition and introducing StereoDetect, a carefully curated benchmark that reveals significant classification failures in current language models.
English Summary:

Authors:Makoto Takamoto, Daniel Oñoro-Rubio, Wiem Ben Rim, Takashi Maruyama, Bhushan Kotnis
Title: Optimal Embedding Guided Negative Sample Generation for Knowledge Graph Link Prediction
Abstract:
Knowledge graph embedding (KGE) models encode the structural information of knowledge graphs to predicting new links. Effective training of these models requires distinguishing between positive and negative samples with high precision. Although prior research has shown that improving the quality of negative samples can significantly enhance model accuracy, identifying high-quality negative samples remains a challenging problem. This paper theoretically investigates the condition under which negative samples lead to optimal KG embedding and identifies a sufficient condition for an effective negative sample distribution. Based on this theoretical foundation, we propose \textbf{E}mbedding \textbf{MU}tation (\textsc{EMU}), a novel framework that \emph{generates} negative samples satisfying this condition, in contrast to conventional methods that focus on \emph{identifying} challenging negative samples within the training data. Importantly, the simplicity of \textsc{EMU} ensures seamless integration with existing KGE models and negative sampling methods. To evaluate its efficacy, we conducted comprehensive experiments across multiple datasets. The results consistently demonstrate significant improvements in link prediction performance across various KGE models and negative sampling methods. Notably, \textsc{EMU} enables performance improvements comparable to those achieved by models with embedding dimension five times larger. An implementation of the method and experiments are available at https://github.com/nec-research/EMU-KG.
中文: 本文提出EMU框架,基于理论条件生成知识图谱嵌入模型的最优负样本,显著提升了多种模型和数据集上的链接预测性能。
English: This paper introduces EMU, a framework that generates optimal negative samples for knowledge graph embedding models based on a theoretical condition, significantly improving link prediction performance across various models and datasets.

Authors:Lin yueyu, Liu Xiao
Title: RWKVTTS: Yet another TTS based on RWKV-7
Abstract:
Human-AI interaction thrives on intuitive and efficient interfaces, among which voice stands out as a particularly natural and accessible modality. Recent advancements in transformer-based text-to-speech (TTS) systems, such as Fish-Speech, CosyVoice, and MegaTTS 3, have delivered remarkable improvements in quality and realism, driving a significant evolution in the TTS domain. In this paper, we introduce RWKV-7 \cite{peng2025rwkv}, a cutting-edge RNN-based architecture tailored for TTS applications. Unlike traditional transformer models, RWKV-7 leverages the strengths of recurrent neural networks to achieve greater computational efficiency and scalability, while maintaining high-quality output. Our comprehensive benchmarks demonstrate that RWKV-7 outperforms transformer-based models across multiple key metrics, including synthesis speed, naturalness of speech, and resource efficiency. Furthermore, we explore its adaptability to diverse linguistic contexts and low-resource environments, showcasing its potential to democratize TTS technology. These findings position RWKV-7 as a powerful and innovative alternative, paving the way for more accessible and versatile voice synthesis solutions in real-world applications.Our code and weights are https://github.com/yynil/RWKVTTS, https://huggingface.co/spaces/RWKV-Red-Team
中文: RWKV-7提出了一种基于RNN的创新架构,在文本转语音应用中超越了传统Transformer模型,在效率、速度和语音自然度上表现更优,有效提升了多语言及低资源环境下的技术普及性。
English: RWKV-7 introduces a novel RNN-based architecture for text-to-speech that surpasses transformer models in efficiency, speed, and naturalness, enhancing accessibility across diverse linguistic and low-resource settings.

Authors:Weitao Li, Kaiming Liu, Xiangyu Zhang, Xuanyu Lei, Weizhi Ma, Yang Liu
Title: Efficient Dynamic Clustering-Based Document Compression for Retrieval-Augmented-Generation
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a widely adopted approach for knowledge injection during large language model (LLM) inference in recent years. However, due to their limited ability to exploit fine-grained inter-document relationships, current RAG implementations face challenges in effectively addressing the retrieved noise and redundancy content, which may cause error in the generation results. To address these limitations, we propose an Efficient Dynamic Clustering-based document Compression framework (EDC2-RAG) that utilizes latent inter-document relationships while simultaneously removing irrelevant information and redundant content. We validate our approach, built upon GPT-3.5-Turbo and GPT-4o-mini, on widely used knowledge-QA and Hallucination-Detection datasets. Experimental results show that our method achieves consistent performance improvements across various scenarios and experimental settings, demonstrating strong robustness and applicability. Our code and datasets are available at https://github.com/Tsinghua-dhy/EDC-2-RAG.
中文: 提出的EDC2-RAG框架通过动态文档聚类来优化检索增强生成技术,有效消除冗余噪声,在GPT模型上的多场景测试中均展现出稳定的性能提升。
English: The proposed EDC2-RAG framework enhances Retrieval-Augmented Generation by dynamically clustering documents to reduce noise and redundancy, demonstrating consistent performance improvements across multiple benchmarks with GPT models.

Authors:Yuxiang Zheng, Dayuan Fu, Xiangkun Hu, Xiaojie Cai, Lyumanshan Ye, Pengrui Lu, Pengfei Liu
Title: DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments
Abstract:
Large Language Models (LLMs) equipped with web search capabilities have demonstrated impressive potential for deep research tasks. However, current approaches predominantly rely on either manually engineered prompts (prompt engineering-based) with brittle performance or reinforcement learning within controlled Retrieval-Augmented Generation (RAG) environments (RAG-based) that fail to capture the complexities of real-world interaction. In this paper, we introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents through scaling reinforcement learning (RL) in real-world environments with authentic web search interactions. Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web. We implement a specialized multi-agent architecture where browsing agents extract relevant information from various webpage structures and overcoming significant technical challenges. Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines and up to 7.2 points over RAG-based RL agents. Our qualitative analysis reveals emergent cognitive behaviors from end-to-end RL training, including the ability to formulate plans, cross-validate information from multiple sources, engage in self-reflection to redirect research, and maintain honesty when unable to find definitive answers. Our results highlight that end-to-end training in real-world web environments is not merely an implementation detail but a fundamental requirement for developing robust research capabilities aligned with real-world applications. We release DeepResearcher at https://github.com/GAIR-NLP/DeepResearcher.
中文: DeepResearcher是一种通过真实网络环境中的强化学习来训练大语言模型成为深度研究代理的创新框架,其性能显著超越现有方法,并展现出规划验证、自我反思等新兴认知能力。
English: DeepResearcher is a novel framework that trains large language models as deep research agents through reinforcement learning in real-world web environments, significantly outperforming existing methods and demonstrating emergent cognitive behaviors for robust information gathering.

Authors:Weili Cao, Jianyou Wang, Youze Zheng, Longtian Bao, Qirui Zheng, Taylor Berg-Kirkpatrick, Ramamohan Paturi, Leon Bergen
Title: Single-Pass Document Scanning for Question Answering
Abstract:
Handling extremely large documents for question answering is challenging: chunk-based embedding methods often lose track of important global context, while full-context transformers can be prohibitively expensive for hundreds of thousands of tokens. We propose a single-pass document scanning approach that processes the entire text in linear time, preserving global coherence while deciding which sentences are most relevant to the query. On 41 QA benchmarks, our single-pass scanner consistently outperforms chunk-based embedding methods and competes with large language models at a fraction of the computational cost. By conditioning on the entire preceding context without chunk breaks, the method preserves global coherence, which is especially important for long documents. Overall, single-pass document scanning offers a simple solution for question answering over massive text. All code, datasets, and model checkpoints are available at https://github.com/MambaRetriever/MambaRetriever
中文: 提出的单遍文档扫描方法以线性时间处理整个文本,保持全局连贯性,在问答基准测试中优于分块方法,且计算成本仅为大型语言模型的一小部分。
English: The proposed single-pass document scanning method efficiently processes entire documents in linear time, preserving global coherence and outperforming chunk-based approaches on QA benchmarks at a fraction of the computational cost.

Authors:Zhihan Zhang, Yixin Cao, Lizi Liao
Title: Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement
Abstract:
Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.
中文摘要:本文提出了一种双重偏好引导的优化框架,通过结合视觉与代码奖励的迭代偏好学习,显著提升了多模态大语言模型在图表转代码任务中的性能,使其达到可与专业系统相媲美的水平。
English Summary: This paper introduces a dual preference-guided refinement framework that enhances Multimodal Large Language Models for chart-to-code generation by combining visual and code rewards through iterative preference learning, significantly improving performance to rival specialized systems.

Authors:Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang
Title: How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Abstract:
Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.
中文摘要:后训练通过调整和扩展知识表征而不改变事实存储位置来增强大语言模型,同时揭示了有助于模型引导与可解释性的差异化真实性表达与拒绝机制。
English Summary: Post-training enhances large language models by adapting and developing knowledge representations without altering factual storage locations, while revealing distinct truthfulness and refusal mechanisms that aid in model steering and interpretability.

Authors:Jinqi Luo, Tianjiao Ding, Kwan Ho Ryan Chan, Hancheng Min, Chris Callison-Burch, René Vidal
Title: Concept Lancet: Image Editing with Compositional Representation Transplant
Abstract:
Diffusion models are widely used for image editing tasks. Existing editing methods often design a representation manipulation procedure by curating an edit direction in the text embedding or score space. However, such a procedure faces a key challenge: overestimating the edit strength harms visual consistency while underestimating it fails the editing task. Notably, each source image may require a different editing strength, and it is costly to search for an appropriate strength via trial-and-error. To address this challenge, we propose Concept Lancet (CoLan), a zero-shot plug-and-play framework for principled representation manipulation in diffusion-based image editing. At inference time, we decompose the source input in the latent (text embedding or diffusion score) space as a sparse linear combination of the representations of the collected visual concepts. This allows us to accurately estimate the presence of concepts in each image, which informs the edit. Based on the editing task (replace/add/remove), we perform a customized concept transplant process to impose the corresponding editing direction. To sufficiently model the concept space, we curate a conceptual representation dataset, CoLan-150K, which contains diverse descriptions and scenarios of visual terms and phrases for the latent dictionary. Experiments on multiple diffusion-based image editing baselines show that methods equipped with CoLan achieve state-of-the-art performance in editing effectiveness and consistency preservation.

Authors:Gaurav Verma, Jiawei Zhou, Mohit Chandra, Srijan Kumar, Munmun De Choudhury
Title: A Framework for Situating Innovations, Opportunities, and Challenges in Advancing Vertical Systems with Large AI Models
Abstract:
Large artificial intelligence (AI) models have garnered significant attention for their remarkable, often "superhuman", performance on standardized benchmarks. However, when these models are deployed in high-stakes verticals such as healthcare, education, and law, they often reveal notable limitations. For instance, they exhibit brittleness to minor variations in input data, present contextually uninformed decisions in critical settings, and undermine user trust by confidently producing or reproducing inaccuracies. These challenges in applying large models necessitate cross-disciplinary innovations to align the models' capabilities with the needs of real-world applications. We introduce a framework that addresses this gap through a layer-wise abstraction of innovations aimed at meeting users' requirements with large models. Through multiple case studies, we illustrate how researchers and practitioners across various fields can operationalize this framework. Beyond modularizing the pipeline of transforming large models into useful "vertical systems", we also highlight the dynamism that exists within different layers of the framework. Finally, we discuss how our framework can guide researchers and practitioners to (i) optimally situate their innovations (e.g., when vertical-specific insights can empower broadly impactful vertical-agnostic innovations), (ii) uncover overlooked opportunities (e.g., spotting recurring problems across verticals to develop practically useful foundation models instead of chasing benchmarks), and (iii) facilitate cross-disciplinary communication of critical challenges (e.g., enabling a shared vocabulary for AI developers, domain experts, and human-computer interaction scholars). Project webpage: https://gaurav22verma.github.io/vertical-systems-with-large-ai-models/

Authors:Leonardo Iurada, Marco Ciccone, Tatiana Tommasi
Title: Efficient Model Editing with Task-Localized Sparse Fine-tuning
Abstract:
Task arithmetic has emerged as a promising approach for editing models by representing task-specific knowledge as composable task vectors. However, existing methods rely on network linearization to derive task vectors, leading to computational bottlenecks during training and inference. Moreover, linearization alone does not ensure weight disentanglement, the key property that enables conflict-free composition of task vectors. To address this, we propose TaLoS which allows to build sparse task vectors with minimal interference without requiring explicit linearization and sharing information across tasks. We find that pre-trained models contain a subset of parameters with consistently low gradient sensitivity across tasks, and that sparsely updating only these parameters allows for promoting weight disentanglement during fine-tuning. Our experiments prove that TaLoS improves training and inference efficiency while outperforming current methods in task addition and negation. By enabling modular parameter editing, our approach fosters practical deployment of adaptable foundation models in real-world applications.
中文: TaLoS提出了一种构建稀疏任务向量的方法,通过仅更新梯度敏感性低的参数子集来提升训练和推理效率,无需线性化即可在任务算术中超越现有方法。
English: TaLoS introduces a method for creating sparse task vectors that enhance training and inference efficiency by updating only a subset of parameters with low gradient sensitivity, outperforming existing approaches in task arithmetic without requiring linearization.

Authors:Yan Ma, Steffi Chern, Xuyang Shen, Yiran Zhong, Pengfei Liu
Title: Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme
Abstract:
Reinforcement learning (RL) has recently shown strong potential in improving the reasoning capabilities of large language models and is now being actively extended to vision-language models (VLMs). However, existing RL applications in VLMs often rely on heavily engineered frameworks that hinder reproducibility and accessibility, while lacking standardized evaluation protocols, making it difficult to compare results or interpret training dynamics. This work introduces a transparent, from-scratch framework for RL in VLMs, offering a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors. Extensive experiments on visual reasoning tasks uncover key empirical findings: response length is sensitive to random seeds, reflection correlates with output length, and RL consistently outperforms supervised fine-tuning (SFT) in generalization, even with high-quality data. These findings, together with the proposed framework, aim to establish a reproducible baseline and support broader engagement in RL-based VLM research.
中文摘要:本研究为视觉语言模型提出了一个透明的强化学习框架,并通过实验验证了关键发现,如强化学习在泛化能力上优于监督微调,以及响应长度对性能的影响。
English Summary: This study introduces a transparent reinforcement learning framework for vision-language models, validated through experiments that reveal key insights like RL's superior generalization over supervised fine-tuning and the influence of response length on performance.

Authors:Abhay Kumar, Louis Owen, Nilabhra Roy Chowdhury, Fabian Güra
Title: ZClip: Adaptive Spike Mitigation for LLM Pre-Training
Abstract:
Training large language models (LLMs) presents numerous challenges, including gradient instability and loss spikes. These phenomena can lead to catastrophic divergence, requiring costly checkpoint restoration and data batch skipping. Traditional gradient clipping techniques, such as constant or norm-based methods, fail to address these issues effectively due to their reliance on fixed thresholds or heuristics, leading to inefficient learning and requiring frequent manual intervention. In this work, we propose ZClip, an adaptive gradient clipping algorithm that dynamically adjusts the clipping threshold based on statistical properties of gradient norms over time. Unlike prior reactive strategies, ZClip proactively adapts to training dynamics without making any prior assumptions on the scale and the temporal evolution of gradient norms. At its core, it leverages z-score-based anomaly detection to identify and mitigate large gradient spikes, preventing malignant loss spikes while not interfering with convergence otherwise. Our code is available at: https://github.com/bluorion-com/ZClip.
中文摘要:ZClip是一种自适应梯度裁剪算法,通过基于z分数的异常检测动态调整阈值,有效防止大语言模型训练中的梯度不稳定和损失峰值问题,同时不影响模型收敛。
English Summary: ZClip is an adaptive gradient clipping algorithm that dynamically adjusts thresholds using z-score-based anomaly detection to prevent gradient instability and loss spikes in large language model training without hindering convergence.

Authors:Chuanqi Cheng, Jian Guan, Wei Wu, Rui Yan
Title: Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation
Abstract:
Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical temporal dependencies or dilute semantic information. We introduce differential distillation, a principled approach that systematically preserves task-relevant information while suppressing redundancy. Based on this principle, we develop ViLAMP, a hierarchical video-language model that processes hour-long videos at "mixed precision" through two key mechanisms: (1) differential keyframe selection that maximizes query relevance while maintaining temporal distinctiveness at the frame level and (2) differential feature merging that preserves query-salient features in non-keyframes at the patch level. Hence, ViLAMP retains full information in keyframes while reducing non-keyframes to their most salient features, resembling mixed-precision training. Extensive experiments demonstrate ViLAMP's superior performance across four video understanding benchmarks, particularly on long-form content. Notably, ViLAMP can process ultra-long videos (up to 10K frames) on a single NVIDIA A100 GPU, achieving substantial computational efficiency while maintaining state-of-the-art performance. Code and model are available at https://github.com/steven-ccq/ViLAMP.
Chinese: ViLAMP通过差分蒸馏方法,在关键帧中保留完整信息并压缩非关键帧特征,实现了在单个GPU上高效处理超长视频的同时保持顶尖性能。
English: ViLAMP introduces differential distillation to efficiently process long videos by preserving key information in keyframes and compressing non-keyframes, achieving state-of-the-art performance with high computational efficiency on a single GPU.

Authors:Xiang Feng, Wentao Jiang, Zengmao Wang, Yong Luo, Pingbo Xu, Baosheng Yu, Hua Jin, Bo Du, Jing Zhang
Title: AnesBench: Multi-Dimensional Evaluation of LLM Reasoning in Anesthesiology
Abstract:
The application of large language models (LLMs) in the medical field has gained significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. In this paper, we systematically evaluate the reasoning capabilities of LLMs in anesthesiology and analyze key factors influencing their performance. To this end, we introduce AnesBench, a cross-lingual benchmark designed to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Through extensive experiments, we first explore how model characteristics, including model scale, Chain of Thought (CoT) length, and language transferability, affect reasoning performance. Then, we further evaluate the effectiveness of different training strategies, leveraging our curated anesthesiology-related dataset, including continuous pre-training (CPT) and supervised fine-tuning (SFT). Additionally, we also investigate how the test-time reasoning techniques, such as Best-of-N sampling and beam search, influence reasoning performance, and assess the impact of reasoning-enhanced model distillation, specifically DeepSeek-R1. We will publicly release AnesBench, along with our CPT and SFT training datasets and evaluation code at https://github.com/MiliLab/AnesBench.
中文: AnesSuite是首个专为麻醉学推理设计的综合数据集套件,通过评估基准和训练数据开发了基线模型Morpheus,该模型在有限训练下实现了显著性能提升,媲美更大规模模型。
English: AnesSuite is introduced as the first comprehensive dataset suite for evaluating and training large language models in anesthesiology reasoning, leading to the development of Morpheus, a baseline model that shows significant performance improvements despite limited training.

Authors:Minheng Ni, Ennan Wu, Zidong Gong, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Kevin Lin, Lijuan Wang, Wangmeng Zuo
Title: Measurement of LLM's Philosophies of Human Nature
Abstract:
The widespread application of artificial intelligence (AI) in various tasks, along with frequent reports of conflicts or violations involving AI, has sparked societal concerns about interactions with AI systems. Based on Wrightsman's Philosophies of Human Nature Scale (PHNS), a scale empirically validated over decades to effectively assess individuals' attitudes toward human nature, we design the standardized psychological scale specifically targeting large language models (LLM), named the Machine-based Philosophies of Human Nature Scale (M-PHNS). By evaluating LLMs' attitudes toward human nature across six dimensions, we reveal that current LLMs exhibit a systemic lack of trust in humans, and there is a significant negative correlation between the model's intelligence level and its trust in humans. Furthermore, we propose a mental loop learning framework, which enables LLM to continuously optimize its value system during virtual interactions by constructing moral scenarios, thereby improving its attitude toward human nature. Experiments demonstrate that mental loop learning significantly enhances their trust in humans compared to persona or instruction prompts. This finding highlights the potential of human-based psychological assessments for LLM, which can not only diagnose cognitive biases but also provide a potential solution for ethical learning in artificial intelligence. We release the M-PHNS evaluation code and data at https://github.com/kodenii/M-PHNS.
中文摘要:本研究基于人类本性哲学量表设计了针对大语言模型的M-PHNS评估体系,发现当前大语言模型普遍存在对人类系统性不信任的现象,且模型智能水平与人类信任度呈负相关,同时提出的心理循环学习框架通过道德场景交互有效提升了模型对人类本性的信任态度。
English Summary: This study introduces the Machine-based Philosophies of Human Nature Scale (M-PHNS), revealing that current large language models systematically distrust humans and showing that higher intelligence correlates with lower trust, while proposing a mental loop learning framework that significantly improves their trust through ethical scenario interactions.

Authors:Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri
Title: TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Abstract:
Large Language Models (LLMs) trained on historical web data inevitably become outdated. We investigate evaluation strategies and update methods for LLMs as new data becomes available. We introduce a web-scale dataset for time-continual pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of magnitude larger than previous continual language modeling benchmarks. We also design time-stratified evaluations across both general CC data and specific domains (Wikipedia, StackExchange, and code documentation) to assess how well various continual learning methods adapt to new data while retaining past knowledge. Our findings demonstrate that, on general CC data, autoregressive meta-schedules combined with a fixed-ratio replay of older data can achieve comparable held-out loss to re-training from scratch, while requiring significantly less computation (2.6x). However, the optimal balance between incorporating new data and replaying old data differs as replay is crucial to avoid forgetting on generic web data but less so on specific domains.
中文: 大型语言模型可通过自回归元调度与选择性旧数据回放实现高效更新,在通用网络数据上需依赖回放防止遗忘,而在特定领域则需求较低,能以2.6倍计算效率达到与完全重新训练相当的效能。
English: Large Language Models (LLMs) can be efficiently updated using autoregressive meta-schedules with selective replay of older data, achieving performance comparable to full retraining while reducing computational costs by 2.6 times, though the need for replay varies between generic web data and specialized domains.

Authors:Zhonghang Li, Lianghao Xia, Xubin Ren, Jiabin Tang, Tianyi Chen, Yong Xu, Chao Huang
Title: Urban Computing in the Era of Large Language Models
Abstract:
Urban computing has emerged as a multidisciplinary field that harnesses data-driven technologies to address challenges and improve urban living. Traditional approaches, while beneficial, often face challenges with generalization, scalability, and contextual understanding. The advent of Large Language Models (LLMs) offers transformative potential in this domain. This survey explores the intersection of LLMs and urban computing, emphasizing the impact of LLMs in processing and analyzing urban data, enhancing decision-making, and fostering citizen engagement. We provide a concise overview of the evolution and core technologies of LLMs. Additionally, we survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring, summarizing essential tasks and prior works in various urban contexts, while highlighting LLMs' functional roles and implementation patterns. Building on this, we propose potential LLM-based solutions to address unresolved challenges. To facilitate in-depth research, we compile a list of available datasets and tools applicable to diverse urban scenarios. Finally, we discuss the limitations of current approaches and outline future directions for advancing LLMs in urban computing.
城市计算利用数据驱动技术应对城市挑战,本综述探讨了大型语言模型(LLMs)如何在交通、环境监测等领域提升数据处理、决策支持和公众参与,同时提出了未来发展方向和相关工具。
Urban computing leverages data-driven technologies to tackle urban challenges, and this survey explores how Large Language Models (LLMs) can enhance data processing, decision-making, and citizen engagement across domains like transportation and environmental monitoring, while proposing future directions and tools for advancement.

Authors:Boshi Wang, Huan Sun
Title: Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure
Abstract:
Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we identify two primary causes of the Reversal Curse stemming from transformers' limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. We demonstrate that the skill of reversal unlocks a new kind of memory integration that enables models to solve large-scale arithmetic reasoning problems via parametric forward-chaining, outperforming frontier LLMs based on non-parametric memory and prolonged explicit reasoning.
Chinese: 大语言模型中的逆转诅咒源于Transformer在概念绑定上的局限,而基于JEPA的新型模型设计结合记忆层不仅突破了这一诅咒,还通过参数化前向链实现了卓越的算术推理能力。
English: The Reversal Curse in LLMs arises from transformers' limitations in conceptual binding, and a novel JEPA-based model design with memory layers overcomes this curse, enabling superior arithmetic reasoning through parametric forward-chaining.

Authors:Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R. Bartoldson, Bhavya Kailkhura, Cihang Xie
Title: STAR-1: Safer Alignment of Reasoning LLMs with 1K Data
Abstract:
This paper introduces STAR-1, a high-quality, just-1k-scale safety dataset specifically designed for large reasoning models (LRMs) like DeepSeek-R1. Built on three core principles -- diversity, deliberative reasoning, and rigorous filtering -- STAR-1 aims to address the critical needs for safety alignment in LRMs. Specifically, we begin by integrating existing open-source safety datasets from diverse sources. Then, we curate safety policies to generate policy-grounded deliberative reasoning samples. Lastly, we apply a GPT-4o-based safety scoring system to select training examples aligned with best practices. Experimental results show that fine-tuning LRMs with STAR-1 leads to an average 40% improvement in safety performance across four benchmarks, while only incurring a marginal decrease (e.g., an average of 1.1%) in reasoning ability measured across five reasoning tasks. Extensive ablation studies further validate the importance of our design principles in constructing STAR-1 and analyze its efficacy across both LRMs and traditional LLMs. Our project page is https://ucsc-vlaa.github.io/STAR-1.

Authors:Giulio Starace, Oliver Jaffe, Dane Sherburn, James Aung, Jun Shern Chan, Leon Maksin, Rachel Dias, Evan Mays, Benjamin Kinsella, Wyatt Thompson, Johannes Heidecke, Amelia Glaese, Tejal Patwardhan
Title: PaperBench: Evaluating AI's Ability to Replicate AI Research
Abstract:
We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Chinese: PaperBench是一个评估AI代理从零开始复现20篇ICML 2024顶尖论文能力的基准测试,通过分层量规和自动评估系统进行客观评测,目前最佳模型仅实现21%的复现完成度,尚未超越人类专家水平。
English: PaperBench is a benchmark that assesses AI agents' ability to replicate 20 high-profile ICML 2024 papers from scratch, using detailed rubrics and an automated LLM judge, with top-performing agents achieving only 21% replication scores and still trailing behind human experts.

Authors:Minhu Park, Hongseok Oh, Eunkyung Choi, Wonseok Hwang
Title: LRAGE: Legal Retrieval Augmented Generation Evaluation Tool
Abstract:
Recently, building retrieval-augmented generation (RAG) systems to enhance the capability of large language models (LLMs) has become a common practice. Especially in the legal domain, previous judicial decisions play a significant role under the doctrine of stare decisis which emphasizes the importance of making decisions based on (retrieved) prior documents. However, the overall performance of RAG system depends on many components: (1) retrieval corpora, (2) retrieval algorithms, (3) rerankers, (4) LLM backbones, and (5) evaluation metrics. Here we propose LRAGE, an open-source tool for holistic evaluation of RAG systems focusing on the legal domain. LRAGE provides GUI and CLI interfaces to facilitate seamless experiments and investigate how changes in the aforementioned five components affect the overall accuracy. We validated LRAGE using multilingual legal benches including Korean (KBL), English (LegalBench), and Chinese (LawBench) by demonstrating how the overall accuracy changes when varying the five components mentioned above. The source code is available at https://github.com/hoorangyee/LRAGE.
Chinese Summary: LRAGE是一个专注于法律领域的开源工具,用于全面评估检索增强生成系统,通过图形界面和命令行界面帮助用户分析五个关键组件对整体准确性的影响。
English Summary: LRAGE is an open-source tool designed for the holistic evaluation of retrieval-augmented generation systems in the legal domain, enabling users to assess the impact of five key components on overall accuracy through both GUI and CLI interfaces.

Authors:Athena Wen, Tanush Patil, Ansh Saxena, Yicheng Fu, Sean O'Brien, Kevin Zhu
Title: FAIRE: Assessing Racial and Gender Bias in AI-Driven Resume Evaluations
Abstract:
In an era where AI-driven hiring is transforming recruitment practices, concerns about fairness and bias have become increasingly important. To explore these issues, we introduce a benchmark, FAIRE (Fairness Assessment In Resume Evaluation), to test for racial and gender bias in large language models (LLMs) used to evaluate resumes across different industries. We use two methods-direct scoring and ranking-to measure how model performance changes when resumes are slightly altered to reflect different racial or gender identities. Our findings reveal that while every model exhibits some degree of bias, the magnitude and direction vary considerably. This benchmark provides a clear way to examine these differences and offers valuable insights into the fairness of AI-based hiring tools. It highlights the urgent need for strategies to reduce bias in AI-driven recruitment. Our benchmark code and dataset are open-sourced at our repository: https://github.com/athenawen/FAIRE-Fairness-Assessment-In-Resume-Evaluation.git.
中文: FAIRE基准测试揭示了用于简历评估的AI模型存在不同程度的种族和性别偏见,强调了在AI驱动招聘中减少偏见的紧迫性。
English: The FAIRE benchmark reveals varying degrees of racial and gender bias in AI models used for resume evaluation, underscoring the need for bias mitigation in AI-driven hiring.

Authors:Lin Zhang, Zhouhong Gu, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Title: LITE: LLM-Impelled efficient Taxonomy Evaluation
Abstract:
This paper presents LITE, an LLM-based evaluation method designed for efficient and flexible assessment of taxonomy quality. To address challenges in large-scale taxonomy evaluation, such as efficiency, fairness, and consistency, LITE adopts a top-down hierarchical evaluation strategy, breaking down the taxonomy into manageable substructures and ensuring result reliability through cross-validation and standardized input formats. LITE also introduces a penalty mechanism to handle extreme cases and provides both quantitative performance analysis and qualitative insights by integrating evaluation metrics closely aligned with task objectives. Experimental results show that LITE demonstrates high reliability in complex evaluation tasks, effectively identifying semantic errors, logical contradictions, and structural flaws in taxonomies, while offering directions for improvement. Code is available at https://github.com/Zhang-l-i-n/TAXONOMY_DETECT .
中文: 本文提出LITE,一种基于大语言模型的评估方法,通过分层策略、交叉验证和惩罚机制高效评估分类体系质量,在识别语义错误和结构缺陷方面展现出高可靠性。
English: This paper introduces LITE, an LLM-based evaluation method that efficiently assesses taxonomy quality through a hierarchical strategy, cross-validation, and penalty mechanisms, demonstrating high reliability in identifying errors and structural flaws.

Authors:Bairu Hou, Yang Zhang, Jiabao Ji, Yujian Liu, Kaizhi Qian, Jacob Andreas, Shiyu Chang
Title: ThinkPrune: Pruning Long Chain-of-Thought of LLMs via Reinforcement Learning
Abstract:
We present ThinkPrune, a simple yet effective method for pruning the thinking length for long-thinking LLMs, which has been found to often produce inefficient and redundant thinking processes. Existing preliminary explorations of reducing thinking length primarily focus on forcing the thinking process to early exit, rather than adapting the LLM to optimize and consolidate the thinking process, and therefore the length-performance tradeoff observed so far is sub-optimal. To fill this gap, ThinkPrune offers a simple solution that continuously trains the long-thinking LLMs via reinforcement learning (RL) with an added token limit, beyond which any unfinished thoughts and answers will be discarded, resulting in a zero reward. To further preserve model performance, we introduce an iterative length pruning approach, where multiple rounds of RL are conducted, each with an increasingly more stringent token limit. We observed that ThinkPrune results in a remarkable performance-length tradeoff -- on the AIME24 dataset, the reasoning length of DeepSeek-R1-Distill-Qwen-1.5B can be reduced by half with only 2% drop in performance. We also observed that after pruning, the LLMs can bypass unnecessary steps while keeping the core reasoning process complete. Code is available at https://github.com/UCSB-NLP-Chang/ThinkPrune.
中文总结:ThinkPrune通过强化学习对长思维大模型进行迭代式思维剪枝,在AIME24数据集上实现了推理长度减半而性能仅下降2%的显著效果。
English Summary: ThinkPrune is a reinforcement learning-based method that optimizes long-thinking LLMs by iteratively pruning redundant reasoning steps, achieving a 50% reduction in reasoning length with minimal performance loss on the AIME24 dataset.

Authors:Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach
Title: When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
Abstract:
Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.
Chinese: 在大语言模型测试时计算扩展中,自我一致性方法通过多数投票选择答案,而生成式奖励模型通过验证链评分,研究发现自我一致性在多数实际计算预算下效率更高,且最优推理策略更倾向于大力扩展解决方案生成。
English: Scaling test-time compute for large language models involves a trade-off between generating more solutions through Self-Consistency and using fewer solutions with Generative Reward Model verification, with findings showing Self-Consistency is more compute-efficient for most practical budgets and that optimal inference favors aggressive solution scaling.

Authors:Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, Yihan Cao, Hui Ren, Xiang Li, Xiaoxiao Li, Yuyin Zhou
Title: MedReason: Eliciting Factual Medical Reasoning Steps in LLMs via Knowledge Graphs
Abstract:
Medical tasks such as diagnosis and treatment planning require precise and complex reasoning, particularly in life-critical domains. Unlike mathematical reasoning, medical reasoning demands meticulous, verifiable thought processes to ensure reliability and accuracy. However, there is a notable lack of datasets that provide transparent, step-by-step reasoning to validate and enhance the medical reasoning ability of AI models. To bridge this gap, we introduce MedReason, a large-scale high-quality medical reasoning dataset designed to enable faithful and explainable medical problem-solving in large language models (LLMs). We utilize a structured medical knowledge graph (KG) to convert clinical QA pairs into logical chains of reasoning, or ``thinking paths'', which trace connections from question elements to answers via relevant KG entities. Each path is validated for consistency with clinical logic and evidence-based medicine. Our pipeline generates detailed reasoning for various medical questions from 7 medical datasets, resulting in a dataset of 32,682 question-answer pairs, each with detailed, step-by-step explanations. Experiments demonstrate that fine-tuning with our dataset consistently boosts medical problem-solving capabilities, achieving significant gains of up to 7.7% for DeepSeek-Ditill-8B. Our top-performing model, MedReason-8B, outperforms the Huatuo-o1-8B, a state-of-the-art medical reasoning model, by up to 4.2% on the clinical benchmark MedBullets. We also engage medical professionals from diverse specialties to assess our dataset's quality, ensuring MedReason offers accurate and coherent medical reasoning. Our data, models, and code is available at https://github.com/UCSC-VLAA/MedReason.
Chinese: MedReason数据集通过知识图谱为32,682个临床问题生成逐步推理路径,填补了医疗推理数据的空白,经专业验证和实验证明能显著提升AI模型的诊断能力。
English: The MedReason dataset addresses the lack of transparent medical reasoning data by using a knowledge graph to generate step-by-step explanations for 32,682 clinical questions, significantly improving AI models' diagnostic accuracy through fine-tuning.

Authors:Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme
Title: WikiVideo: Article Generation from Multiple Videos
Abstract:
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple diverse videos about real-world events, such as natural disasters or political elections. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text and existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.
中文摘要:本文提出了WikiVideo基准,用于从多视频生成维基百科式文章,并开发了协作文章生成(CAG)方法,通过推理模型与视频大模型的交互提升事件语义理解能力,其性能显著优于现有技术。
English Summary: This paper introduces WikiVideo, a benchmark for generating Wikipedia-style articles from multiple videos, and proposes Collaborative Article Generation (CAG), an interactive method that enhances high-level event understanding by combining reasoning models with VideoLLMs, outperforming existing approaches.

Authors:Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, Xin Eric Wang
Title: Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Abstract:
Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.
Chinese: Agent S2通过组合式框架结合混合定位技术和主动分层规划,解决了图形界面交互中的核心难题,在多项计算机使用基准测试中创下了最优性能记录。
English: Agent S2 introduces a compositional framework with Mixture-of-Grounding and Proactive Hierarchical Planning to overcome GUI interaction challenges, achieving state-of-the-art performance across multiple computer use benchmarks.

Authors:Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou
Title: GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
Abstract:
Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.
中文: GenPRM提出了一种生成式过程奖励模型,通过结合代码验证的显式思维链推理来解决现有模型的局限性,在多项任务中显著超越先前方法,为大型语言模型的过程监督建立了新范式。
English: GenPRM introduces a generative process reward model that uses explicit Chain-of-Thought reasoning with code verification to overcome limitations in current process reward models, significantly outperforming prior methods and establishing a new paradigm for process supervision in large language models.

Authors:Wei Zhou, Yuyang Gao, Xuanhe Zhou, Guoliang Li
Title: CrackSQL: A Hybrid SQL Dialect Translation System Powered by Large Language Models
Abstract:
Dialect translation plays a key role in enabling seamless interaction across heterogeneous database systems. However, translating SQL queries between different dialects (e.g., from PostgreSQL to MySQL) remains a challenging task due to syntactic discrepancies and subtle semantic variations. Existing approaches including manual rewriting, rule-based systems, and large language model (LLM)-based techniques often involve high maintenance effort (e.g., crafting custom translation rules) or produce unreliable results (e.g., LLM generates non-existent functions), especially when handling complex queries. In this demonstration, we present CrackSQL, the first hybrid SQL dialect translation system that combines rule and LLM-based methods to overcome these limitations. CrackSQL leverages the adaptability of LLMs to minimize manual intervention, while enhancing translation accuracy by segmenting lengthy complex SQL via functionality-based query processing. To further improve robustness, it incorporates a novel cross-dialect syntax embedding model for precise syntax alignment, as well as an adaptive local-to-global translation strategy that effectively resolves interdependent query operations. CrackSQL supports three translation modes and offers multiple deployment and access options including a web console interface, a PyPI package, and a command-line prompt, facilitating adoption across a variety of real-world use cases
中文: CrackSQL是一种结合规则与大型语言模型的混合式SQL方言翻译系统,通过功能化查询分割和跨方言语法嵌入技术提升翻译准确性,同时支持多种部署模式以适应实际应用场景。
English: CrackSQL is a hybrid SQL dialect translation system that combines rule-based and LLM-based methods to enhance accuracy and reduce manual effort by segmenting complex queries and employing cross-dialect syntax alignment.

Authors:Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, Yuyin Zhou
Title: m1: Unleash the Potential of Test-Time Scaling for Medical Reasoning with Large Language Models
Abstract:
Test-time scaling has emerged as a powerful technique for enhancing the reasoning capabilities of large language models. However, its effectiveness in medical reasoning remains uncertain, as the medical domain fundamentally differs from mathematical tasks in terms of knowledge representation and decision-making processes. In this paper, we provide the first comprehensive investigation of test-time scaling for medical reasoning and present m1, a simple yet effective approach that increases a model's medical reasoning capability at inference. Our evaluation across diverse medical tasks demonstrates that test-time scaling consistently enhances medical reasoning, enabling lightweight fine-tuned models under 10B parameters to establish new state-of-the-art performance, while our 32B model rivals previous 70B-scale medical LLMs. However, we identify an optimal reasoning token budget of approximately 4K, beyond which performance may degrade due to overthinking. Budget forcing, which extends test-time computation through iterative prompts, helps models double-check answers but does not necessarily improve the overall medical QA performance and, in some cases, even introduces errors into previously correct responses. Our case-by-case analysis identifies insufficient medical knowledge as a key bottleneck that prevents further performance gains through test-time scaling. We find that increasing data scale, improving data quality, and expanding model capacity consistently enhance medical knowledge grounding, enabling continued performance improvements, particularly on challenging medical benchmarks where smaller models reach saturation. These findings underscore fundamental differences between medical and mathematical reasoning in LLMs, highlighting that enriched medical knowledge, other than increased reasoning depth alone, is essential for realizing the benefits of test-time scaling.
中文: 测试时扩展显著提升了语言模型的医学推理能力,最佳推理标记预算约为4K,但其效果受限于医学知识不足而非仅推理深度。
English: Test-time scaling significantly enhances medical reasoning in language models, with an optimal token budget of around 4K, but its effectiveness is limited by insufficient medical knowledge rather than reasoning depth alone.

Authors:Lin Zhang, Zhouhong Gu, Xiaoran Shi, Hongwei Feng, Yanghua Xiao
Title: RECKON: Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model
Abstract:
As large language models (LLMs) advance, efficient knowledge evaluation becomes crucial to verifying their capabilities. Traditional methods, relying on benchmarks, face limitations such as high resource costs and information loss. We propose the Large-scale Reference-based Efficient Knowledge Evaluation for Large Language Model (RECKON), which directly uses reference data to evaluate models. RECKON organizes unstructured data into manageable units and generates targeted questions for each cluster, improving evaluation accuracy and efficiency. Experimental results show that RECKON reduces resource consumption by 56.5% compared to traditional methods while achieving over 97% accuracy across various domains, including world knowledge, code, legal, and biomedical datasets. Code is available at https://github.com/MikeGu721/reckon
中文: RECKON是一种基于参考数据的大语言模型高效知识评估方法,通过将非结构化数据组织成可管理单元并生成针对性问题,在降低56.5%资源消耗的同时,在多个领域保持了超过97%的准确率。
English: RECKON is an efficient knowledge evaluation method for large language models that uses reference data to generate targeted questions, reducing resource consumption by 56.5% while maintaining over 97% accuracy across multiple domains.

Authors:Yunsoo Kim, Michal W. S. Ong, Daniel W. Rogalsky, Manuel Rodriguez-Justo, Honghan Wu, Adam P. Levine
Title: IHC-LLMiner: Automated extraction of tumour immunohistochemical profiles from PubMed abstracts using large language models
Abstract:
Immunohistochemistry (IHC) is essential in diagnostic pathology and biomedical research, offering critical insights into protein expression and tumour biology. This study presents an automated pipeline, IHC-LLMiner, for extracting IHC-tumour profiles from PubMed abstracts, leveraging advanced biomedical text mining. There are two subtasks: abstract classification (include/exclude as relevant) and IHC-tumour profile extraction on relevant included abstracts. The best-performing model, "Gemma-2 finetuned", achieved 91.5% accuracy and an F1 score of 91.4, outperforming GPT4-O by 9.5% accuracy with 5.9 times faster inference time. From an initial dataset of 107,759 abstracts identified for 50 immunohistochemical markers, the classification task identified 30,481 relevant abstracts (Include) using the Gemma-2 finetuned model. For IHC-tumour profile extraction, the Gemma-2 finetuned model achieved the best performance with 63.3% Correct outputs. Extracted IHC-tumour profiles (tumour types and markers) were normalised to Unified Medical Language System (UMLS) concepts to ensure consistency and facilitate IHC-tumour profile landscape analysis. The extracted IHC-tumour profiles demonstrated excellent concordance with available online summary data and provided considerable added value in terms of both missing IHC-tumour profiles and quantitative assessments. Our proposed LLM based pipeline provides a practical solution for large-scale IHC-tumour profile data mining, enhancing the accessibility and utility of such data for research and clinical applications as well as enabling the generation of quantitative and structured data to support cancer-specific knowledge base development. Models and training datasets are available at https://github.com/knowlab/IHC-LLMiner.
中文: 本研究开发了IHC-LLMiner自动化流程,通过微调Gemma-2模型从PubMed摘要中高效提取并标准化免疫组化-肿瘤特征图谱,实现了高精度的大规模生物医学数据挖掘,为癌症研究提供有力支持。
English: This study introduces IHC-LLMiner, an automated pipeline using fine-tuned Gemma-2 models to efficiently extract and normalize IHC-tumor profiles from PubMed abstracts, achieving high accuracy and enabling large-scale biomedical data mining for cancer research.

Authors:Xiaoxuan Zhu, Zhouhong Gu, Baiqian Wu, Suhang Zheng, Tao Wang, Tianyu Li, Hongwei Feng, Yanghua Xiao
Title: ToReMi: Topic-Aware Data Reweighting for Dynamic Pre-Training Data Selection
Abstract:
Pre-training large language models (LLMs) necessitates enormous diverse textual corpora, making effective data selection a key challenge for balancing computational resources and model performance. Current methodologies primarily emphasize data quality metrics and mixing proportions, yet they fail to adequately capture the underlying semantic connections between training samples and quality disparities within individual domains. We introduce ToReMi (Topic-based Reweighting for Model improvement), a novel two-stage framework that dynamically adjusts training sample weights according to their topical associations and observed learning patterns. Our comprehensive experiments reveal that ToReMi variants consistently achieve superior performance over conventional pre-training approaches, demonstrating accelerated perplexity reduction across multiple domains and enhanced capabilities on downstream evaluation tasks. Code is available at https://github.com/zxx000728/ToReMi.
中文: ToReMi框架通过主题关联和学习模式动态调整训练数据权重,在困惑度降低和下游任务表现上均优于传统方法。
English: The ToReMi framework dynamically adjusts training data weights based on topic associations and learning patterns, consistently outperforming conventional methods in both perplexity reduction and downstream task performance.

Authors:Anthony Yazdani, Ihor Stepanov, Douglas Teodoro
Title: GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition
Abstract:
Biomedical named entity recognition (NER) presents unique challenges due to specialized vocabularies, the sheer volume of entities, and the continuous emergence of novel entities. Traditional NER models, constrained by fixed taxonomies and human annotations, struggle to generalize beyond predefined entity types. To address these issues, we introduce GLiNER-BioMed, a domain-adapted suite of Generalist and Lightweight Model for NER (GLiNER) models specifically tailored for biomedicine. In contrast to conventional approaches, GLiNER uses natural language labels to infer arbitrary entity types, enabling zero-shot recognition. Our approach first distills the annotation capabilities of large language models (LLMs) into a smaller, more efficient model, enabling the generation of high-coverage synthetic biomedical NER data. We subsequently train two GLiNER architectures, uni- and bi-encoder, at multiple scales to balance computational efficiency and recognition performance. Experiments on several biomedical datasets demonstrate that GLiNER-BioMed outperforms the state-of-the-art in both zero- and few-shot scenarios, achieving 5.96% improvement in F1-score over the strongest baseline (p-value < 0.001). Ablation studies highlight the effectiveness of our synthetic data generation strategy and emphasize the complementary benefits of synthetic biomedical pre-training combined with fine-tuning on general-domain annotations. All datasets, models, and training pipelines are publicly available at https://github.com/ds4dh/GLiNER-biomed.
中文: GLiNER-BioMed提出了一种针对生物医学领域优化的轻量级命名实体识别模型,通过自然语言标签实现零样本实体识别,并借助合成数据生成策略在多个数据集上以5.96%的F1分数显著超越现有最佳方法。
English: GLiNER-BioMed introduces a domain-adapted, lightweight NER model that uses natural language labels for zero-shot recognition of biomedical entities, outperforming state-of-the-art methods with a 5.96% F1-score improvement through synthetic data generation and efficient model training.

Authors:Jirui Qi, Raquel Fernández, Arianna Bisazza
Title: On the Consistency of Multilingual Context Utilization in Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) with large language models (LLMs) has demonstrated strong performance in multilingual question-answering (QA) tasks by leveraging relevant passages retrieved from corpora. In multilingual RAG (mRAG), the retrieved passages can be written in languages other than that of the query entered by the user, making it challenging for LLMs to effectively utilize the provided information. Recent research suggests that retrieving passages from multilingual corpora can improve RAG performance, particularly for low-resource languages. However, the extent to which LLMs can leverage different kinds of multilingual contexts to generate accurate answers, *independently from retrieval quality*, remains understudied. In this paper, we conduct an extensive assessment of LLMs' ability to (i) make consistent use of a relevant passage regardless of its language, (ii) respond in the expected language, and (iii) focus on the relevant passage even when multiple `distracting' passages in different languages are provided in the context. Our experiments with four LLMs across three QA datasets covering a total of 48 languages reveal a surprising ability of LLMs to extract the relevant information from out-language passages, but a much weaker ability to formulate a full answer in the correct language. Our analysis, based on both accuracy and feature attribution techniques, further shows that distracting passages negatively impact answer quality regardless of their language. However, distractors in the query language exert a slightly stronger influence. Taken together, our findings deepen the understanding of how LLMs utilize context in mRAG systems, providing directions for future improvements.
中文: 多语言检索增强生成使大语言模型能有效提取不同语言段落中的相关信息,但在正确语言中生成完整答案的能力较弱,尤其当存在多语言干扰段落时影响更显著。
English: Multilingual retrieval-augmented generation enables large language models to effectively extract relevant information from passages in different languages, yet they struggle to consistently produce full answers in the correct language, especially when distracted by irrelevant multilingual passages.

Authors:Owen Cook, Jake Vasilakes, Ian Roberts, Xingyi Song
Title: Efficient Annotator Reliability Assessment with EffiARA
Abstract:
Data annotation is an essential component of the machine learning pipeline; it is also a costly and time-consuming process. With the introduction of transformer-based models, annotation at the document level is increasingly popular; however, there is no standard framework for structuring such tasks. The EffiARA annotation framework is, to our knowledge, the first project to support the whole annotation pipeline, from understanding the resources required for an annotation task to compiling the annotated dataset and gaining insights into the reliability of individual annotators as well as the dataset as a whole. The framework's efficacy is supported by two previous studies: one improving classification performance through annotator-reliability-based soft-label aggregation and sample weighting, and the other increasing the overall agreement among annotators through removing identifying and replacing an unreliable annotator. This work introduces the EffiARA Python package and its accompanying webtool, which provides an accessible graphical user interface for the system. We open-source the EffiARA Python package at https://github.com/MiniEggz/EffiARA and the webtool is publicly accessible at https://effiara.gate.ac.uk.
中文:EffiARA框架是首个支持完整文档级标注流程的综合解决方案,通过提升标注可靠性和效率得到验证,现已作为开源Python包及易用的网络工具发布。
English: The EffiARA framework is the first comprehensive solution supporting the entire document-level annotation pipeline, enhancing reliability and efficiency, as demonstrated in previous studies, and is now available as an open-source Python package with a user-friendly webtool.

Authors:Qianhao Yuan, Qingyu Zhang, Yanjiang Liu, Jiawei Chen, Yaojie Lu, Hongyu Lin, Jia Zheng, Xianpei Han, Le Sun
Title: ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers
Abstract:
Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV
中文: 本文提出ShortV方法,无需训练即可通过识别并冻结多模态大语言模型中处理视觉令牌的无效层,在保持性能的同时显著降低计算成本,例如在LLaVA-NeXT-13B上实现50%的FLOPs减少。
English: This paper introduces ShortV, a training-free method that reduces computational costs in Multimodal Large Language Models by identifying and freezing ineffective layers during visual token processing, achieving up to 50% FLOPs reduction while maintaining performance.

Authors:Jie Ma, Zhitao Gao, Qi Chai, Jun Liu, Pinghui Wang, Jing Tao, Zhou Su
Title: FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Abstract:
Audio-Visual Question Answering (AVQA) is a challenging multimodal reasoning task requiring intelligent systems to answer natural language queries based on paired audio-video inputs accurately. However, existing AVQA approaches often suffer from overfitting to dataset biases, leading to poor robustness. Moreover, current datasets may not effectively diagnose these methods. To address these challenges, we first introduce a novel dataset, FortisAVQA, constructed in two stages: (1) rephrasing questions in the test split of the public MUSIC-AVQA dataset and (2) introducing distribution shifts across questions. The first stage expands the test space with greater diversity, while the second enables a refined robustness evaluation across rare, frequent, and overall question distributions. Second, we introduce a robust Multimodal Audio-Visual Epistemic Network (MAVEN) that leverages a multifaceted cycle collaborative debiasing strategy to mitigate bias learning. Experimental results demonstrate that our architecture achieves state-of-the-art performance on FortisAVQA, with a notable improvement of 7.81\%. Extensive ablation studies on both datasets validate the effectiveness of our debiasing components. Additionally, our evaluation reveals the limited robustness of existing multimodal QA methods. We also verify the plug-and-play capability of our strategy by integrating it with various baseline models across both datasets. Our dataset and code are available at https://github.com/reml-group/fortisavqa.
Chinese: 本文提出了FortisAVQA数据集,通过重构问题和引入分布偏移来解决视听问答中的过拟合与鲁棒性问题,并设计了MAVEN去偏网络,在性能上实现了7.81%的显著提升,达到最先进水平。
English: This paper introduces FortisAVQA, a novel dataset designed to address overfitting and robustness issues in Audio-Visual Question Answering by incorporating rephrased questions and distribution shifts, and proposes MAVEN, a debiasing network that achieves state-of-the-art performance with a 7.81% improvement.

Authors:Jiuzhou Han, Wray Buntine, Ehsan Shareghi
Title: VerifiAgent: a Unified Verification Agent in Language Model Reasoning
Abstract:
Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain. Code is available at https://github.com/Jiuzhouh/VerifiAgent
Chinese: VerifiAgent提出了一种结合元验证与自适应工具选择的统一验证框架,在所有推理任务中超越基线方法,同时提升准确率与效率。
English: VerifiAgent introduces a unified verification framework with meta-verification and adaptive tool selection, outperforming baseline methods across reasoning tasks while improving accuracy and efficiency.

Authors:Joshua Rodriguez, Om Sanan, Guillermo Vizarreta-Luna, Steven A. Conrad
Title: Text Chunking for Document Classification for Urban System Management using Large Language Models
Abstract:
Urban systems are managed using complex textual documentation that need coding and analysis to set requirements and evaluate built environment performance. This paper contributes to the study of applying large-language models (LLM) to qualitative coding activities to reduce resource requirements while maintaining comparable reliability to humans. Qualitative coding and assessment face challenges like resource limitations and bias, accuracy, and consistency between human evaluators. Here we report the application of LLMs to deductively code 10 case documents on the presence of 17 digital twin characteristics for the management of urban systems. We utilize two prompting methods to compare the semantic processing of LLMs with human coding efforts: whole text analysis and text chunk analysis using OpenAI's GPT-4o, GPT-4o-mini, and o1-mini models. We found similar trends of internal variability between methods and results indicate that LLMs may perform on par with human coders when initialized with specific deductive coding contexts. GPT-4o, o1-mini and GPT-4o-mini showed significant agreement with human raters when employed using a chunking method. The application of both GPT-4o and GPT-4o-mini as an additional rater with three manual raters showed statistically significant agreement across all raters, indicating that the analysis of textual documents is benefited by LLMs. Our findings reveal nuanced sub-themes of LLM application suggesting LLMs follow human memory coding processes where whole-text analysis may introduce multiple meanings. The novel contributions of this paper lie in assessing the performance of OpenAI GPT models and introduces the chunk-based prompting approach, which addresses context aggregation biases by preserving localized context.
本研究证明,当采用适当的提示方法(特别是通过保留局部语境的块分析)时,大型语言模型能够以与人类编码者相当的可靠性完成城市规划文档的定性编码。
This study demonstrates that large-language models can perform qualitative coding of urban planning documents with reliability comparable to human coders when using appropriate prompting methods, particularly through chunk-based analysis that preserves local context.

Authors:Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He
Title: SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers
Abstract:
This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful non-reasoning and reasoning LLMs as foundational models. The best-performing LLM using \ModelName~achieves only 39% execution accuracy, highlighting the benchmark's difficulty. Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We make available our benchmark and code at https://github.com/xyzCS/SciReplicate-Bench and project homepage at https://xyzcs.github.io/scireplicate.github.io/.
中文摘要:本研究评估大语言模型根据NLP论文算法描述生成代码的能力,提出了高难度的SciReplicate-Bench基准和双智能体框架,最佳模型执行准确率仅达39%,揭示了算法描述缺失或不一致是成功复现的主要障碍。
English Summary: This study assesses large language models' ability to generate code from algorithm descriptions in NLP papers, introducing the challenging SciReplicate-Bench benchmark and a dual-agent framework that achieved only 39% execution accuracy, revealing significant reproduction barriers.

Authors:Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, Kam-Fai Wong
Title: Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models
Abstract:
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to perform complex reasoning tasks, transitioning from fast and intuitive thinking (System 1) to slow and deep reasoning (System 2). While System 2 reasoning improves task accuracy, it often incurs substantial computational costs due to its slow thinking nature and inefficient or unnecessary reasoning behaviors. In contrast, System 1 reasoning is computationally efficient but leads to suboptimal performance. Consequently, it is critical to balance the trade-off between performance (benefits) and computational costs (budgets), giving rise to the concept of reasoning economy. In this survey, we provide a comprehensive analysis of reasoning economy in both the post-training and test-time inference stages of LLMs, encompassing i) the cause of reasoning inefficiency, ii) behavior analysis of different reasoning patterns, and iii) potential solutions to achieve reasoning economy. By offering actionable insights and highlighting open challenges, we aim to shed light on strategies for improving the reasoning economy of LLMs, thereby serving as a valuable resource for advancing research in this evolving area. We also provide a public repository to continually track developments in this fast-evolving field.
中文: 本综述通过分析快速直觉思维的计算效率与深度推理的准确性之间的权衡,探究大语言模型的推理经济性,涵盖训练和推理阶段中的效率成因、行为模式及潜在解决方案。
English: This survey explores reasoning economy in Large Language Models by analyzing the trade-offs between System 1's computational efficiency and System 2's accuracy, examining causes of inefficiency, behavioral patterns, and potential solutions across training and inference stages.

Authors:Yi Chen, Yuying Ge, Rui Wang, Yixiao Ge, Lu Qiu, Ying Shan, Xihui Liu
Title: Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1
Abstract:
Recent advancements in Chain of Thought (COT) generation have significantly improved the reasoning capabilities of Large Language Models (LLMs), with reinforcement learning (RL) emerging as an effective post-training approach. Multimodal Large Language Models (MLLMs) inherit this reasoning potential but remain underexplored in tasks requiring both perception and logical reasoning. To address this, we introduce SEED-Bench-R1, a benchmark designed to systematically evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions, requiring sophisticated perception and reasoning. SEED-Bench-R1 assesses generalization through a three-level hierarchy: in-distribution, cross-environment, and cross-environment-task scenarios, equipped with a large-scale training dataset with easily verifiable ground-truth answers. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT), demonstrating RL's data efficiency and superior performance on both in-distribution and out-of-distribution tasks, even outperforming SFT on general video understanding benchmarks like LongVideoBench. Our detailed analysis reveals that RL enhances visual perception but often produces less logically coherent reasoning chains. We identify key limitations such as inconsistent reasoning and overlooked visual cues, and suggest future improvements in base model reasoning, reward modeling, and RL robustness against noisy signals.
Chinese: 强化学习的最新进展提升了多模态大语言模型的推理能力,SEED-Bench-R1基准测试验证了其有效性,但在复杂视觉推理任务中保持逻辑连贯性仍是待解决的挑战。
English: Recent advancements in reinforcement learning have enhanced multimodal large language models' reasoning capabilities, as demonstrated by the SEED-Bench-R1 benchmark, though challenges remain in maintaining logical coherence during complex visual reasoning tasks.

Authors:Zhengren Wang, Rui Ling, Chufan Wang, Yongan Yu, Sizhe Wang, Zhiyu Li, Feiyu Xiong, Wentao Zhang
Title: MaintainCoder: Maintainable Code Generation Under Dynamic Requirements
Abstract:
Modern code generation has made significant strides in functional correctness and execution efficiency. However, these systems often overlook a critical dimension in real-world software development: \textit{maintainability}. To handle dynamic requirements with minimal rework, we propose \textbf{MaintainCoder} as a pioneering solution. It integrates the Waterfall model, design patterns, and multi-agent collaboration to systematically enhance cohesion, reduce coupling, achieving clear responsibility boundaries and better maintainability. We also introduce \textbf{MaintainBench}, a benchmark comprising requirement changes and novel dynamic metrics on maintenance efforts. Experiments demonstrate that existing code generation methods struggle to meet maintainability standards when requirements evolve. In contrast, MaintainCoder improves dynamic maintainability metrics by more than 60\% with even higher correctness of initial codes. Furthermore, while static metrics fail to accurately reflect maintainability and even contradict each other, our proposed dynamic metrics exhibit high consistency. Our work not only provides the foundation for maintainable code generation, but also highlights the need for more realistic and comprehensive code generation research.
中文摘要:MaintainCoder通过融合瀑布模型、设计模式和多智能体协作,提出了一种创新方法,将动态可维护性指标提升超过60%,同时揭示了现有静态指标在反映代码可维护性方面的不足。
English Summary: MaintainCoder introduces a novel approach integrating the Waterfall model, design patterns, and multi-agent collaboration to significantly enhance code maintainability, improving dynamic metrics by over 60% while demonstrating the limitations of existing static metrics.

Authors:Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Wenyue Hua, Haolun Wu, Zhihan Guo, Yufei Wang, Niklas Muennighoff, Irwin King, Xue Liu, Chen Ma
Title: A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?
Abstract:
As enthusiasm for scaling computation (data and parameters) in the pretraining era gradually diminished, test-time scaling (TTS), also referred to as ``test-time computing'' has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in specialized reasoning tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering a systemic understanding. To fill this gap, we propose a unified, multidimensional framework structured along four core dimensions of TTS research: what to scale, how to scale, where to scale, and how well to scale. Building upon this taxonomy, we conduct an extensive review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique functional roles of individual techniques within the broader TTS landscape. From this analysis, we distill the major developmental trajectories of TTS to date and offer hands-on guidelines for practical deployment. Furthermore, we identify several open challenges and offer insights into promising future directions, including further scaling, clarifying the functional essence of techniques, generalizing to more tasks, and more attributions. Our repository is available on https://github.com/testtimescaling/testtimescaling.github.io/
中文摘要:随着预训练阶段计算扩展的热潮渐退,测试时扩展已成为研究热点,它能显著提升大语言模型在推理和通用任务中的能力,本文为此提出系统性框架并总结实践指南。
English Summary: As interest in scaling computation during pretraining wanes, test-time scaling has become a key research area, enhancing LLMs' capabilities in reasoning and general tasks, prompting this comprehensive survey that introduces a framework and practical guidelines for the field.

Authors:Karim Radouane, Hanane Azzag, Mustapha lebbah
Title: MB-ORES: A Multi-Branch Object Reasoner for Visual Grounding in Remote Sensing
Abstract:
We propose a unified framework that integrates object detection (OD) and visual grounding (VG) for remote sensing (RS) imagery. To support conventional OD and establish an intuitive prior for VG task, we fine-tune an open-set object detector using referring expression data, framing it as a partially supervised OD task. In the first stage, we construct a graph representation of each image, comprising object queries, class embeddings, and proposal locations. Then, our task-aware architecture processes this graph to perform the VG task. The model consists of: (i) a multi-branch network that integrates spatial, visual, and categorical features to generate task-aware proposals, and (ii) an object reasoning network that assigns probabilities across proposals, followed by a soft selection mechanism for final referring object localization. Our model demonstrates superior performance on the OPT-RSVG and DIOR-RSVG datasets, achieving significant improvements over state-of-the-art methods while retaining classical OD capabilities. The code will be available in our repository: \url{https://github.com/rd20karim/MB-ORES}.
中文摘要:本研究提出了一种统一框架,将目标检测与视觉定位相结合应用于遥感图像,通过微调开放集检测器和任务感知架构,在基准数据集上实现了优于现有方法的性能,同时保持了传统检测功能。
English Summary: This study introduces a unified framework that combines object detection and visual grounding for remote sensing imagery, utilizing a fine-tuned open-set detector and a task-aware architecture to achieve state-of-the-art performance on benchmark datasets while preserving traditional detection capabilities.

Authors:Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
Title: TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
Abstract:
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.
中文: 本文提出了首个开源音频-文本电信反欺诈数据集TeleAntiFraud-28k,通过隐私保护合成和语义增强策略解决数据稀缺问题,并配套开发了评估基准与优化模型,为多模态反欺诈研究奠定基础。
English: This paper introduces TeleAntiFraud-28k, the first open-source audio-text dataset designed for telecom fraud detection, addressing data scarcity through privacy-preserved synthesis and semantic enhancement strategies, along with a benchmark and fine-tuned model for multimodal anti-fraud research.

Authors:Diana Galvan-Sosa, Gabrielle Gaudeau, Pride Kavumba, Yunmeng Li, Hongyi gu, Zheng Yuan, Keisuke Sakaguchi, Paula Buttery
Title: Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Abstract:
The performance and usability of Large-Language Models (LLMs) are driving their use in explanation generation tasks. However, despite their widespread adoption, LLM explanations have been found to be unreliable, making it difficult for users to distinguish good from bad explanations. To address this issue, we present Rubrik's CUBE, an education-inspired rubric and a dataset of 26k explanations, written and later quality-annotated using the rubric by both humans and six open- and closed-source LLMs. The CUBE dataset focuses on two reasoning and two language tasks, providing the necessary diversity for us to effectively test our proposed rubric. Using Rubrik, we find that explanations are influenced by both task and perceived difficulty. Low quality stems primarily from a lack of conciseness in LLM-generated explanations, rather than cohesion and word choice. The full dataset, rubric, and code are available at https://github.com/RubriksCube/rubriks_cube.
中文: 大型语言模型虽广泛用于生成解释,但其可靠性存疑,为此我们推出Rubrik's CUBE,这是一个基于教育理念的评分标准和包含2.6万条解释的数据集,用于评估不同任务中解释的质量。
English: Large-Language Models are increasingly used for generating explanations but often produce unreliable results, leading to the development of Rubrik's CUBE, a rubric and dataset to evaluate explanation quality across various tasks.

Authors:Yuqiao Tan, Shizhu He, Huanxuan Liao, Jun Zhao, Kang Liu
Title: Dynamic Parametric Retrieval Augmented Generation for Test-time Knowledge Enhancement
Abstract:
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by retrieving relevant documents from external sources and incorporating them into the context. While it improves reliability by providing factual texts, it significantly increases inference costs as context length grows and introduces challenging issue of RAG hallucination, primarily caused by the lack of corresponding parametric knowledge in LLMs. An efficient solution is to enhance the knowledge of LLMs at test-time. Parametric RAG (PRAG) addresses this by embedding document into LLMs parameters to perform test-time knowledge enhancement, effectively reducing inference costs through offline training. However, its high training and storage costs, along with limited generalization ability, significantly restrict its practical adoption. To address these challenges, we propose Dynamic Parametric RAG (DyPRAG), a novel framework that leverages a lightweight parameter translator model to efficiently convert documents into parametric knowledge. DyPRAG not only reduces inference, training, and storage costs but also dynamically generates parametric knowledge, seamlessly enhancing the knowledge of LLMs and resolving knowledge conflicts in a plug-and-play manner at test-time. Extensive experiments on multiple datasets demonstrate the effectiveness and generalization capabilities of DyPRAG, offering a powerful and practical RAG paradigm which enables superior knowledge fusion and mitigates RAG hallucination in real-world applications. Our code is available at https://github.com/Trae1ounG/DyPRAG.
中文: 检索增强生成(RAG)通过引入外部文档提升大语言模型性能,但面临高推理成本和幻觉问题;动态参数化RAG(DyPRAG)通过轻量参数转换器将文档高效转化为参数知识,在测试时降低各类成本并增强模型泛化能力。
English: Retrieval-augmented generation (RAG) improves large language models by incorporating external documents but faces issues like high inference costs and hallucinations, which the proposed Dynamic Parametric RAG (DyPRAG) addresses by efficiently converting documents into parametric knowledge to reduce costs and enhance performance at test-time.

Authors:Lu Fan, Jiashu Pu, Rongsheng Zhang, Xiao-Ming Wu
Title: LANID: LLM-assisted New Intent Discovery
Abstract:
Task-oriented Dialogue Systems (TODS) often face the challenge of encountering new intents. New Intent Discovery (NID) is a crucial task that aims to identify these novel intents while maintaining the capability to recognize existing ones. Previous efforts to adapt TODS to new intents have struggled with inadequate semantic representation or have depended on external knowledge, which is often not scalable or flexible. Recently, Large Language Models (LLMs) have demonstrated strong zero-shot capabilities; however, their scale can be impractical for real-world applications that involve extensive queries. To address the limitations of existing NID methods by leveraging LLMs, we propose LANID, a framework that enhances the semantic representation of lightweight NID encoders with the guidance of LLMs. Specifically, LANID employs the $K$-nearest neighbors and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithms to sample selective utterance pairs from the training set. It then queries an LLM to ascertain the relationships between these pairs. The data produced from this process is utilized to design a contrastive fine-tuning task, which is then used to train a small encoder with a contrastive triplet loss. Our experimental results demonstrate the efficacy of the proposed method across three distinct NID datasets, surpassing strong baselines in both unsupervised and semi-supervised settings. Our code is available at https://github.com/floatSDSDS/LANID.
Chinese: LANID框架利用大型语言模型(LLMs)生成选择性话语对,并通过对比性微调增强轻量级新意图发现(NID)编码器的语义表示,在无监督和半监督设置下,于多个数据集上实现了卓越性能。
English: The LANID framework enhances lightweight New Intent Discovery (NID) encoders by leveraging Large Language Models (LLMs) to generate selective utterance pairs and employing contrastive fine-tuning, achieving superior performance across multiple datasets in both unsupervised and semi-supervised settings.

Authors:Yoonshik Kim, Jaeyoon Jung
Title: KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Abstract:
The recent emergence of Large Vision-Language Models(VLMs) has resulted in a variety of different benchmarks for evaluating such models. Despite this, we observe that most existing evaluation methods suffer from the fact that they either require the model to choose from pre-determined responses, sacrificing open-endedness, or evaluate responses using a judge model, resulting in subjective and unreliable evaluation. In addition, we observe a lack of benchmarks for VLMs in the Korean language, which are necessary as a separate metric from more common English language benchmarks, as the performance of generative language models can differ significantly based on the language being used. Therefore, we present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language for the evaluation of VLMs. Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria covering 10 different aspects of VLM performance. The grading criteria eliminate the problem of unreliability by allowing the judge model to grade each response based on a pre-determined set of rules. By defining the evaluation criteria in an objective manner, even a small open-source model can be used to evaluate models on our benchmark reliably. In addition to evaluating a large number of existing VLMs on our benchmark, we also experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods. Our evaluation code is available at https://github.com/maum-ai/KOFFVQA
中文: 现有视觉语言模型评估方法常限制开放性回答或依赖主观评判,因此我们推出KOFFVQA韩语开放式视觉问答基准,通过预设评分标准实现客观可靠的模型评估。
English: Current VLM evaluations often sacrifice open-endedness or rely on subjective judge models, so we introduce KOFFVQA—a Korean free-form visual question answering benchmark with objective grading criteria to ensure reliable evaluation.

Authors:Zhengren Wang, Jiayang Yu, Dongsheng Ma, Zhe Chen, Yu Wang, Zhiyu Li, Feiyu Xiong, Yanfeng Wang, Weinan E, Linpeng Tang, Wentao Zhang
Title: RARE: Retrieval-Augmented Reasoning Modeling
Abstract:
Domain-specific intelligence demands specialized knowledge and sophisticated reasoning for problem-solving, posing significant challenges for large language models (LLMs) that struggle with knowledge hallucination and inadequate reasoning capabilities under constrained parameter budgets. Inspired by Bloom's Taxonomy in educational theory, we propose Retrieval-Augmented Reasoning Modeling (RARE), a novel paradigm that decouples knowledge storage from reasoning optimization. RARE externalizes domain knowledge to retrievable sources and internalizes domain-specific reasoning patterns during training. Specifically, by injecting retrieved knowledge into training prompts with masked losses, RARE transforms learning objectives from rote memorization to contextualized reasoning. It enables models to bypass parameter-intensive memorization and prioritize the development of higher-order cognitive processes. Extensive experiments demonstrate that lightweight RARE-trained models (e.g., Llama-3.1-8B) could achieve state-of-the-art performance, surpassing retrieval-augmented GPT-4 and DeepSeek-R1 up to approximately 20\% accuracy. RARE establishes a paradigm shift where maintainable external knowledge bases synergize with compact, reasoning-optimized models, collectively driving more scalable domain-specific intelligence.
中文: RARE提出了一种将知识存储与推理优化分离的新范式,通过外部化领域知识和内部化推理模式,使轻量级模型实现顶尖性能,准确率超越大型模型高达20%。
English: RARE introduces a paradigm that separates knowledge storage from reasoning optimization, enabling lightweight models to achieve state-of-the-art performance by externalizing domain knowledge and internalizing reasoning patterns, surpassing larger models by up to 20% accuracy.

Authors:Reza Esfandiarpoor, George Zerveas, Ruochen Zhang, Macton Mgonzo, Carsten Eickhoff, Stephen H. Bach
Title: Beyond Contrastive Learning: Synthetic Data Enables List-wise Training with Multiple Levels of Relevance
Abstract:
Recent advancements in large language models (LLMs) have allowed the augmentation of information retrieval (IR) pipelines with synthetic data in various ways. Yet, the main training paradigm remains: contrastive learning with binary relevance labels and the InfoNCE loss, where one positive document is compared against one or more negatives. This objective treats all documents that are not explicitly annotated as relevant on an equally negative footing, regardless of their actual degree of relevance, thus (a) missing subtle nuances that are useful for ranking and (b) being susceptible to annotation noise. To overcome this limitation, in this work we forgo real training documents and annotations altogether and use open-source LLMs to directly generate synthetic documents that answer real user queries according to several different levels of relevance. This fully synthetic ranking context of graduated relevance, together with an appropriate list-wise loss (Wasserstein distance), enables us to train dense retrievers in a way that better captures the ranking task. Experiments on various IR datasets show that our proposed approach outperforms conventional training with InfoNCE by a large margin. Without using any real documents for training, our dense retriever significantly outperforms the same retriever trained through self-supervision. More importantly, it matches the performance of the same retriever trained on real, labeled training documents of the same dataset, while being more robust to distribution shift and clearly outperforming it when evaluated zero-shot on the BEIR dataset collection.
中文摘要:本研究提出了一种新方法,通过使用大语言模型生成具有分级相关性的全合成文档,结合列表式Wasserstein损失函数训练密集检索器,该方法显著优于传统对比学习方法,在达到与真实标注数据训练相当性能的同时,对分布偏移表现出更强的鲁棒性。
English Summary: This study introduces a novel approach to training dense retrievers by generating fully synthetic documents with graduated relevance levels using LLMs, which, combined with a list-wise Wasserstein loss, significantly outperforms traditional contrastive learning methods and matches the performance of training on real labeled data while offering greater robustness to distribution shifts.

Authors:Anjiang Wei, Tarun Suresh, Jiannan Cao, Naveen Kannan, Yuheng Wu, Kai Yan, Thiago S. F. X. Teixeira, Ke Wang, Alex Aiken
Title: CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis
Abstract:
Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural language, their ability to perform inductive program synthesis is underexplored. Existing evaluation protocols rely on static sets of examples and held-out tests, offering no feedback when synthesized functions are incorrect and failing to reflect real-world scenarios such as reverse engineering. We propose CodeARC, the Code Abstraction and Reasoning Challenge, a new evaluation framework where agents interact with a hidden target function by querying it with new inputs, synthesizing candidate functions, and iteratively refining their solutions using a differential testing oracle. This interactive setting encourages agents to perform function calls and self-correction based on feedback. We construct the first large-scale benchmark for general-purpose inductive program synthesis, featuring 1114 functions. Among 18 models evaluated, o3-mini performs best with a success rate of 52.7%, highlighting the difficulty of this task. Fine-tuning LLaMA-3.1-8B-Instruct on curated synthesis traces yields up to a 31% relative performance gain. CodeARC provides a more realistic and challenging testbed for evaluating LLM-based program synthesis and inductive reasoning. Our code, data, and models are publicly available at https://github.com/Anjiang-Wei/CodeARC
中文: CodeARC提出了一个交互式评估框架,用于归纳程序合成,智能体通过隐藏目标函数的反馈迭代优化方案,实验表明最佳模型成功率仅达52.7%,而微调可带来显著性能提升。
English: CodeARC introduces an interactive evaluation framework for inductive program synthesis, where agents iteratively refine functions using feedback from a hidden target, with experiments showing top-performing models achieving 52.7% success and fine-tuning yielding significant improvements.

Authors:Yue Liu, Jiaying Wu, Yufei He, Ruihan Gong, Jun Xia, Liang Li, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, Bryan Hooi, Stan Z. Li, Keqin Li
Title: Efficient Inference for Large Reasoning Models: A Survey
Abstract:
Large Reasoning Models (LRMs) significantly improve the reasoning ability of Large Language Models (LLMs) by learning to reason, exhibiting promising performance in solving complex tasks. However, their deliberative reasoning process leads to inefficiencies in token usage, memory consumption, and inference time. Thus, this survey provides a review of efficient inference methods designed specifically for LRMs, focusing on mitigating token inefficiency while preserving the reasoning quality. The overview structure of this paper is shown in Figure~\ref{fig:paper_structure}. First, we introduce a taxonomy to group the recent methods into two main categories: (a) explicit compact Chain-of-Thought (CoT), which reduces tokens while keeping the explicit reasoning structure, and (b) implicit latent CoT, which encodes reasoning steps within hidden representations instead of explicit tokens. Meanwhile, we discuss their strengths and weaknesses. Then, we conduct empirical analyses on existing methods from reasoning scenarios, object functions, and performance \& efficiency aspects. Besides, we present open challenges in this field, including human-centric controllable reasoning, trade-off between interpretability and efficiency of reasoning, ensuring the safety of efficient reasoning, and broader applications of efficient reasoning. In addition, we highlight key insights for enhancing LRMs' inference efficiency via techniques such as model merging, new architectures, and agent routers. We hope this work serves as a valuable guide, helping researchers overcome challenges in this vibrant field. A collection of efficient reasoning methods for LRMs (papers and codes) is provided at this link: https://github.com/yueliu1999/Awesome-Efficient-Inference-for-LRMs.
中文: 本综述针对大型推理模型在推理过程中存在的令牌低效问题,系统评述了保持推理质量的高效推理方法,将其分为显式紧凑思维链和隐式潜在思维链两类,并探讨了其优劣、实证分析及未来挑战。
English: This survey reviews efficient inference methods for Large Reasoning Models (LRMs) to address token inefficiency while maintaining reasoning quality, categorizing approaches into explicit compact Chain-of-Thought and implicit latent CoT while analyzing their trade-offs and future challenges.

Authors:Aske Plaat, Max van Duijn, Niki van Stein, Mike Preuss, Peter van der Putten, Kees Joost Batenburg
Title: Agentic Large Language Models, a survey
Abstract:
There is great interest in agentic LLMs, large language models that act as agents. We review the growing body of work in this area and provide a research agenda. Agentic LLMs are LLMs that (1) reason, (2) act, and (3) interact. We organize the literature according to these three categories. The research in the first category focuses on reasoning, reflection, and retrieval, aiming to improve decision making; the second category focuses on action models, robots, and tools, aiming for agents that act as useful assistants; the third category focuses on multi-agent systems, aiming for collaborative task solving and simulating interaction to study emergent social behavior. We find that works mutually benefit from results in other categories: retrieval enables tool use, reflection improves multi-agent collaboration, and reasoning benefits all categories. We discuss applications of agentic LLMs and provide an agenda for further research. Important applications are in medical diagnosis, logistics and financial market analysis. Meanwhile, self-reflective agents playing roles and interacting with one another augment the process of scientific research itself. Further, agentic LLMs may provide a solution for the problem of LLMs running out of training data: inference-time behavior generates new training states, such that LLMs can keep learning without needing ever larger datasets. We note that there is risk associated with LLM assistants taking action in the real world, while agentic LLMs are also likely to benefit society.

Authors:Gabriel Recchia, Chatrik Singh Mangat, Issac Li, Gayatri Krishnakumar
Title: FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research
Abstract:
As AI models tackle increasingly complex problems, ensuring reliable human oversight becomes more challenging due to the difficulty of verifying solutions. Approaches to scaling AI supervision include debate, in which two agents engage in structured dialogue to help a judge evaluate claims; critique, in which models identify potential flaws in proposed solutions; and prover-verifier games, in which a capable 'prover' model generates solutions that must be verifiable by a less capable 'verifier'. Evaluations of the scalability of these and similar approaches to difficult problems benefit from datasets that include (1) long-form expert-verified correct solutions and (2) long-form flawed solutions with annotations highlighting specific errors, but few are available. To address this gap, we present FindTheFlaws, a group of five diverse datasets spanning medicine, mathematics, science, coding, and the Lojban language. Each dataset contains questions and long-form solutions with expert annotations validating their correctness or identifying specific error(s) in the reasoning. We evaluate frontier models' critiquing capabilities and observe a range of performance that can be leveraged for scalable oversight experiments: models performing more poorly on particular datasets can serve as judges/verifiers for more capable models. Additionally, for some task/dataset combinations, expert baselines exceed even top model performance, making them more beneficial for scalable oversight experiments.
中文: 为解决可扩展AI监督中专家标注数据稀缺的问题,FindTheFlaws数据集集合提供了五个涵盖多领域的标注数据集,包含已验证解决方案和错误标注,既能评估模型的批判能力,又可通过让较弱模型验证较强模型来支持可扩展监督实验。
English: To address the scarcity of expert-annotated datasets for scalable AI oversight, the FindTheFlaws collection provides five diverse datasets with validated solutions and error annotations, enabling evaluation of model critiquing capabilities and supporting scalable oversight experiments by pairing weaker models as verifiers for stronger ones.

Authors:Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu
Title: Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Abstract:
State Space Models (SSMs) are emerging as a compelling alternative to Transformers because of their consistent memory usage and high performance. Despite this, scaling up SSMs on cloud services or limited-resource devices is challenging due to their storage requirements and computational power. To overcome this, quantizing SSMs with low bit-width data formats can reduce model size and benefit from hardware acceleration. As SSMs are prone to quantization-induced errors, recent efforts have focused on optimizing a particular model or bit-width for efficiency without sacrificing performance. However, distinct bit-width configurations are essential for different scenarios, like W4A8 for boosting large-batch decoding speed, and W4A16 for enhancing generation speed in short prompt applications for a single user. To this end, we present Quamba2, compatible with W8A8, W4A8, and W4A16 for both Mamba1 and Mamba2 backbones, addressing the growing demand for SSM deployment on various platforms. Based on the channel order preserving and activation persistence of SSMs, we propose an offline approach to quantize inputs of a linear recurrence in 8-bit by sorting and clustering for input $x$, combined with a per-state-group quantization for input-dependent parameters $B$ and $C$. To ensure compute-invariance in the SSM output, we rearrange weights offline according to the clustering sequence. The experiments show that Quamba2-8B outperforms two state-of-the-art SSM quantization methods and delivers 1.3$\times$ and 3$\times$ speed-ups in the pre-filling and generation stages, respectively, while offering 4$\times$ memory reduction with only a $1.6\%$ average accuracy drop. The evaluation on MMLU shows the generalizability and robustness of our framework. The code and quantized models will be released at: https://github.com/enyac-group/Quamba.
中文:Quamba2是一种适用于状态空间模型的多位宽量化框架,通过离线排序聚类和权重重排技术,在实现4倍内存压缩和显著加速的同时,仅造成1.6%的精度损失,有效支持不同部署场景的需求。
English: Quamba2 is a versatile quantization framework for State Space Models that supports multiple bit-width configurations to reduce memory usage and accelerate computation while maintaining accuracy, achieving significant speed-ups and memory savings with minimal performance loss.

Authors:Belinda Z. Li, Been Kim, Zi Wang
Title: QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Abstract:
Recently, a large amount of work has focused on improving large language models' (LLMs') performance on reasoning benchmarks such as math and logic. However, past work has largely assumed that tasks are well-defined. In the real world, queries to LLMs are often underspecified, only solvable through acquiring missing information. We formalize this as a constraint satisfaction problem (CSP) with missing variable assignments. Using a special case of this formalism where only one necessary variable assignment is missing, we can rigorously evaluate an LLM's ability to identify the minimal necessary question to ask and quantify axes of difficulty levels for each problem. We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question, which includes: (1) Logic-Q: Logical reasoning tasks with one missing proposition, (2) Planning-Q: PDDL planning problems with initial states that are partially-observed, (3) GSM-Q: Human-annotated grade school math problems with one missing variable assignment, and (4) GSME-Q: a version of GSM-Q where word problems are translated into equations by human annotators. The LLM is tasked with selecting the correct clarification question(s) from a list of options. While state-of-the-art models excel at GSM-Q and GSME-Q, their accuracy is only 40-50% on Logic-Q and Planning-Q. Analysis demonstrates that the ability to solve well-specified reasoning problems may not be sufficient for success on our benchmark: models have difficulty identifying the right question to ask, even when they can solve the fully specified version of the problem. Furthermore, in the Planning-Q domain, LLMs tend not to hedge, even when explicitly presented with the option to predict ``not sure.'' This highlights the need for deeper investigation into models' information acquisition capabilities.
中文: 最新研究提出QuestBench基准,用于评估大语言模型在信息不全的推理任务中识别最小必要问题的能力,发现即使最先进的模型在逻辑和规划问题上表现不佳,尽管它们在明确定义的任务中表现出色。
English: Recent research introduces QuestBench, a benchmark evaluating LLMs' ability to identify minimal necessary questions in underspecified reasoning tasks, revealing that even state-of-the-art models struggle with logic and planning problems despite excelling in well-defined scenarios.

Authors:Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Title: ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Abstract:
Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.
中文摘要:ActionStudio是一个轻量级可扩展框架,通过统一智能体轨迹、优化分布式训练流程,将吞吐量提升高达9倍,并开源了工具和包含9.8万条轨迹的数据集,显著提升大动作模型的训练效率。
English Summary: ActionStudio is a lightweight and extensible framework that enhances large action model training by unifying agent trajectories, optimizing distributed workflows, and achieving up to 9x higher throughput while releasing open-source tools and a 98k trajectory dataset.

Authors:Yizhang Zhu, Runzhi Jiang, Boyan Li, Nan Tang, Yuyu Luo
Title: EllieSQL: Cost-Efficient Text-to-SQL with Complexity-Aware Routing
Abstract:
Text-to-SQL automatically translates natural language queries to SQL, allowing non-technical users to retrieve data from databases without specialized SQL knowledge. Despite the success of advanced LLM-based Text-to-SQL approaches on leaderboards, their unsustainable computational costs--often overlooked--stand as the "elephant in the room" in current leaderboard-driven research, limiting their economic practicability for real-world deployment and widespread adoption. To tackle this, we exploratively propose EllieSQL, a complexity-aware routing framework that assigns queries to suitable SQL generation pipelines based on estimated complexity. We investigate multiple routers to direct simple queries to efficient approaches while reserving computationally intensive methods for complex cases. Drawing from economics, we introduce the Token Elasticity of Performance (TEP) metric, capturing cost-efficiency by quantifying the responsiveness of performance gains relative to token investment in SQL generation. Experiments show that compared to always using the most advanced methods in our study, EllieSQL with the Qwen2.5-0.5B-DPO router reduces token use by over 40% without compromising performance on Bird development set, achieving more than a 2x boost in TEP over non-routing approaches. This not only advances the pursuit of cost-efficient Text-to-SQL but also invites the community to weigh resource efficiency alongside performance, contributing to progress in sustainable Text-to-SQL. Our source code and model are available at https://elliesql.github.io/.
Chinese: 该研究提出EllieSQL,一种基于复杂度的路由框架,通过将查询分配给合适的SQL生成流程,在不损失性能的情况下将计算成本降低超过40%,并利用新提出的性能代币弹性指标提升了成本效益。
English: The study introduces EllieSQL, a complexity-aware routing framework that optimizes Text-to-SQL systems by directing queries to appropriate SQL generation pipelines, reducing computational costs by over 40% without performance loss and enhancing cost-efficiency through a novel Token Elasticity of Performance metric.

Authors:Yubo Li, Yidi Miao, Xueying Ding, Ramayya Krishnan, Rema Padman
Title: Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent and coherent behavior across multiple rounds of user interaction. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. Code and data are available at: https://github.com/yubol-bobo/MT-Consistency. First, we introduce Position-Weighted Consistency (PWC), a metric designed to capture both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present MT-Consistency, a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by explicitly integrating internal model confidence scores during the generation process. Experimental results demonstrate that CARG significantly improves response stability without sacrificing accuracy, offering a practical path toward more dependable LLM behavior in critical, real-world deployments.
中文摘要:本文提出了一个包含三项关键创新的综合框架——新型一致性评估指标、专业基准数据集和置信度感知生成方法,可在不牺牲准确性的前提下显著提升大语言模型在多轮对话中的稳定性和可靠性。
English Summary: This paper presents a comprehensive framework with three key innovations—a novel consistency metric, a specialized benchmark dataset, and a confidence-aware generation method—to significantly enhance the stability and reliability of Large Language Models in multi-turn interactions without compromising accuracy.

Authors:Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra
Title: A Refined Analysis of Massive Activations in LLMs
Abstract:
Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.
中文: 本研究分析了多种大型语言模型中的大规模激活现象,挑战了先前假设,指出并非所有此类激活都有害且现有缓解策略具有模型特异性,同时提出了结合目标方差重缩放与注意力KV偏置或动态Tanh的混合方法,能在不影响性能的情况下有效管理激活。
English: This study analyzes massive activations across various large language models, challenging previous assumptions by showing that not all are harmful and that existing mitigation strategies are model-specific, while proposing hybrid approaches like combining Target Variance Rescaling with Attention KV bias or Dynamic Tanh to effectively manage activations without compromising performance.

Authors:Chung-En Sun, Ge Yan, Tsui-Wei Weng
Title: ThinkEdit: Interpretable Weight Editing to Mitigate Overly Short Thinking in Reasoning Models
Abstract:
Recent studies have shown that Large Language Models (LLMs) augmented with chain-of-thought (CoT) reasoning demonstrate impressive problem-solving abilities. However, in this work, we identify a recurring issue where these models occasionally generate overly short reasoning, leading to degraded performance on even simple mathematical problems. Specifically, we investigate how reasoning length is embedded in the hidden representations of reasoning models and its impact on accuracy. Our analysis reveals that reasoning length is governed by a linear direction in the representation space, allowing us to induce overly short reasoning by steering the model along this direction. Building on this insight, we introduce \textbf{\textit{ThinkEdit}}, a simple yet effective weight-editing approach to mitigate the issue of overly short reasoning. We first identify a small subset of attention heads (approximately 4%) that predominantly drive short reasoning behavior. We then edit the output projection weights of these heads to remove the short reasoning direction. With changes to only 0.2% of the model's parameters, \textbf{\textit{ThinkEdit}} effectively reduces overly short reasoning and yields notable accuracy gains for short reasoning outputs (+6.39%), along with an overall improvement across multiple math benchmarks (+3.34%). Our findings provide new mechanistic insights into how reasoning length is controlled within LLMs and highlight the potential of fine-grained model interventions to improve reasoning quality. Our code is available at: https://github.com/Trustworthy-ML-Lab/ThinkEdit\
中文: 近期研究发现思维链增强的大语言模型中推理过短会降低性能,为此提出的ThinkEdit方法通过选择性编辑少量注意力头权重,有效减少过短推理并显著提升数学任务准确率。
English: Recent research identifies that overly short reasoning in chain-of-thought augmented LLMs impairs performance, leading to the development of ThinkEdit, a weight-editing method that selectively modifies a small subset of attention heads to effectively reduce this issue and improve accuracy on mathematical tasks.

Authors:Jiancheng Zhao, Xingda Yu, Zhen Yang
Title: MSPLoRA: A Multi-Scale Pyramid Low-Rank Adaptation for Efficient Model Fine-Tuning
Abstract:
Parameter-Efficient Fine-Tuning (PEFT) has become an essential approach for adapting large-scale pre-trained models while reducing computational costs. Among PEFT methods, LoRA significantly reduces trainable parameters by decomposing weight updates into low-rank matrices. However, traditional LoRA applies a fixed rank across all layers, failing to account for the varying complexity of hierarchical information, which leads to inefficient adaptation and redundancy. To address this, we propose MSPLoRA (Multi-Scale Pyramid LoRA), which introduces Global Shared LoRA, Mid-Level Shared LoRA, and Layer-Specific LoRA to capture global patterns, mid-level features, and fine-grained information, respectively. This hierarchical structure reduces inter-layer redundancy while maintaining strong adaptation capability. Experiments on various NLP tasks demonstrate that MSPLoRA achieves more efficient adaptation and better performance while significantly reducing the number of trainable parameters. Furthermore, additional analyses based on Singular Value Decomposition validate its information decoupling ability, highlighting MSPLoRA as a scalable and effective optimization strategy for parameter-efficient fine-tuning in large language models. Our code is available at https://github.com/Oblivioniss/MSPLoRA.
中文:MSPLoRA提出了一种分层参数高效微调方法,通过采用多尺度LoRA模块分别处理全局、中层级和层特定特征,在减少冗余的同时提升适应能力,在多种自然语言处理任务中以更少参数量实现了更优性能。
English: MSPLoRA introduces a hierarchical parameter-efficient fine-tuning method that reduces redundancy and improves adaptation by employing multi-scale LoRA modules for global, mid-level, and layer-specific features, achieving superior performance with fewer parameters across NLP tasks.

Authors:Yongce Li, Chung-En Sun, Tsui-Wei Weng
Title: Effective Skill Unlearning through Intervention and Abstention
Abstract:
Large language Models (LLMs) have demonstrated remarkable skills across various domains. Understanding the mechanisms behind their abilities and implementing controls over them is becoming increasingly important for developing better models. In this paper, we focus on skill unlearning in LLMs, specifically unlearning a particular skill while retaining their overall capabilities. We introduce two lightweight, training-free machine skill unlearning techniques for LLMs. First, we observe that the pre-activation distribution of neurons in each Feed-Forward Layer (FFL) differs when the model demonstrates different skills. Additionally, we find that queries triggering the same skill cluster within the FFL key space and can be separated from other queries using a hypercube. Based on these observations, we propose two lightweight, training-free skill unlearning methods via \textit{intervention} and \textit{abstention} respectively: \texttt{Neuron Adjust} and \texttt{Key Space Detection}. We evaluate our methods on unlearning math-solving, Python-coding, and comprehension skills across seven different languages. The results demonstrate their strong unlearning capabilities for the designated skills. Specifically, \texttt{Key Space Detection} achieves over 80\% relative performance drop on the forgetting skill and less than 10\% relative performance drop on other skills and the model's general knowledge (MMLU) for most unlearning tasks. Our code is available at https://github.com/Trustworthy-ML-Lab/effective_skill_unlearning
中文: 本文提出了两种轻量级、无需训练的大语言模型技能遗忘方法,能有效消除特定技能,同时保持模型的整体能力和通用知识。
English: This paper introduces two lightweight, training-free methods for skill unlearning in large language models, which effectively eliminate specific skills while preserving overall capabilities and general knowledge.

Authors:Pietro Tropeano, Maria Maistro, Tuukka Ruotsalo, Christina Lioma
Title: As easy as PIE: understanding when pruning causes language models to disagree
Abstract:
Language Model (LM) pruning compresses the model by removing weights, nodes, or other parts of its architecture. Typically, pruning focuses on the resulting efficiency gains at the cost of effectiveness. However, when looking at how individual data points are affected by pruning, it turns out that a particular subset of data points always bears most of the brunt (in terms of reduced accuracy) when pruning, but this effect goes unnoticed when reporting the mean accuracy of all data points. These data points are called PIEs and have been studied in image processing, but not in NLP. In a study of various NLP datasets, pruning methods, and levels of compression, we find that PIEs impact inference quality considerably, regardless of class frequency, and that BERT is more prone to this than BiLSTM. We also find that PIEs contain a high amount of data points that have the largest influence on how well the model generalises to unseen data. This means that when pruning, with seemingly moderate loss to accuracy across all data points, we in fact hurt tremendously those data points that matter the most. We trace what makes PIEs both hard and impactful to inference to their overall longer and more semantically complex text. These findings are novel and contribute to understanding how LMs are affected by pruning. The code is available at: https://github.com/pietrotrope/AsEasyAsPIE
中文: 语言模型剪枝常忽视对特定数据点(称为PIEs)造成的显著准确性下降,这些数据点对模型泛化至关重要,且在BERT等模型中因文本更长、语义更复杂而更易受影响。
English: Language model pruning often overlooks the disproportionate accuracy loss on specific data points called PIEs, which are crucial for generalization and more affected in models like BERT due to their longer and semantically complex texts.

Authors:Wenqi Zhang, Mengna Wang, Gangao Liu, Xu Huixin, Yiwei Jiang, Yongliang Shen, Guiyang Hou, Zhe Zheng, Hang Zhang, Xin Li, Weiming Lu, Peng Li, Yueting Zhuang
Title: Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
Abstract:
Recent advances in deep thinking models have demonstrated remarkable reasoning capabilities on mathematical and coding tasks. However, their effectiveness in embodied domains which require continuous interaction with environments through image action interleaved trajectories remains largely -unexplored. We present Embodied Reasoner, a model that extends o1 style reasoning to interactive embodied search tasks. Unlike mathematical reasoning that relies primarily on logical deduction, embodied scenarios demand spatial understanding, temporal reasoning, and ongoing self-reflection based on interaction history. To address these challenges, we synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes (analysis, spatial reasoning, reflection, planning, and verification). We develop a three-stage training pipeline that progressively enhances the model's capabilities through imitation learning, self-exploration via rejection sampling, and self-correction through reflection tuning. The evaluation shows that our model significantly outperforms those advanced visual reasoning models, e.g., it exceeds OpenAI o1, o3-mini, and Claude-3.7 by +9\%, 24\%, and +13\%. Analysis reveals our model exhibits fewer repeated searches and logical inconsistencies, with particular advantages in complex long-horizon tasks. Real-world environments also show our superiority while exhibiting fewer repeated searches and logical inconsistency cases.
中文摘要:Embodied Reasoner模型通过合成数千条观察-思考-行动轨迹并采用三阶段训练流程,将深度推理能力扩展到交互式具身任务中,在评估中显著超越了先进的视觉推理模型。
English Summary: The Embodied Reasoner model extends deep reasoning to interactive embodied tasks by synthesizing thousands of Observation-Thought-Action trajectories and employing a three-stage training pipeline, significantly outperforming advanced visual reasoning models in evaluations.

Authors:Yuwei Yin, EunJeong Hwang, Giuseppe Carenini
Title: SWI: Speaking with Intent in Large Language Models
Abstract:
Intent, typically clearly formulated and planned, functions as a cognitive framework for communication and problem-solving. This paper introduces the concept of Speaking with Intent (SWI) in large language models (LLMs), where the explicitly generated intent encapsulates the model's underlying intention and provides high-level planning to guide subsequent analysis and action. By emulating deliberate and purposeful thoughts in the human mind, SWI is hypothesized to enhance the reasoning capabilities and generation quality of LLMs. Extensive experiments on text summarization, multi-task question answering, and mathematical reasoning benchmarks consistently demonstrate the effectiveness and generalizability of Speaking with Intent over direct generation without explicit intent. Further analysis corroborates the generalizability of SWI under different experimental settings. Moreover, human evaluations verify the coherence, effectiveness, and interpretability of the intent produced by SWI. The promising results in enhancing LLMs with explicit intents pave a new avenue for boosting LLMs' generation and reasoning abilities with cognitive notions.
中文: 本文提出“有意图对话”方法,通过显式生成意图来增强大语言模型的推理能力和生成质量,实验和人工评估均证实其有效性。
English: This paper proposes Speaking with Intent (SWI) for large language models, where explicit intent generation enhances reasoning and output quality, as validated by experiments and human evaluation.

Authors:Haote Yang, Xingjian Wei, Jiang Wu, Noémi Ligeti-Nagy, Jiaxing Sun, Yinfan Wang, Zijian Győző Yang, Junyuan Gao, Jingchao Wang, Bowen Jiang, Shasha Wang, Nanjun Yu, Zihao Zhang, Shixin Hong, Hongwei Liu, Wei Li, Songyang Zhang, Dahua Lin, Lijun Wu, Gábor Prószéky, Conghui He
Title: OpenHuEval: Evaluating Large Language Model on Hungarian Specifics
Abstract:
We introduce OpenHuEval, the first benchmark for LLMs focusing on the Hungarian language and specifics. OpenHuEval is constructed from a vast collection of Hungarian-specific materials sourced from multiple origins. In the construction, we incorporated the latest design principles for evaluating LLMs, such as using real user queries from the internet, emphasizing the assessment of LLMs' generative capabilities, and employing LLM-as-judge to enhance the multidimensionality and accuracy of evaluations. Ultimately, OpenHuEval encompasses eight Hungarian-specific dimensions, featuring five tasks and 3953 questions. Consequently, OpenHuEval provides the comprehensive, in-depth, and scientifically accurate assessment of LLM performance in the context of the Hungarian language and its specifics. We evaluated current mainstream LLMs, including both traditional LLMs and recently developed Large Reasoning Models. The results demonstrate the significant necessity for evaluation and model optimization tailored to the Hungarian language and specifics. We also established the framework for analyzing the thinking processes of LRMs with OpenHuEval, revealing intrinsic patterns and mechanisms of these models in non-English languages, with Hungarian serving as a representative example. We will release OpenHuEval at https://github.com/opendatalab/OpenHuEval .
中文: OpenHuEval是首个针对匈牙利语的大语言模型评测基准,通过真实用户查询和多维度评估方法,揭示了针对匈牙利语特性进行模型优化的必要性。
English: OpenHuEval is the first comprehensive benchmark designed to evaluate large language models on Hungarian language proficiency, incorporating real user queries and multidimensional assessments to highlight the need for language-specific model optimization.

Authors:Ryan Marinelli, Josef Pichlmeier, Tamas Bisztray
Title: Harnessing Chain-of-Thought Metadata for Task Routing and Adversarial Prompt Detection
Abstract:
In this work, we propose a metric called Number of Thoughts (NofT) to determine the difficulty of tasks pre-prompting and support Large Language Models (LLMs) in production contexts. By setting thresholds based on the number of thoughts, this metric can discern the difficulty of prompts and support more effective prompt routing. A 2% decrease in latency is achieved when routing prompts from the MathInstruct dataset through quantized, distilled versions of Deepseek with 1.7 billion, 7 billion, and 14 billion parameters. Moreover, this metric can be used to detect adversarial prompts used in prompt injection attacks with high efficacy. The Number of Thoughts can inform a classifier that achieves 95% accuracy in adversarial prompt detection. Our experiments ad datasets used are available on our GitHub page: https://github.com/rymarinelli/Number_Of_Thoughts/tree/main.
中文: 本文提出“思维数量”指标来评估任务难度并优化大语言模型的提示路由,实现了2%的延迟降低,并在对抗性提示检测中达到95%的准确率。
English: This paper introduces the Number of Thoughts (NofT) metric to assess task difficulty and enhance prompt routing for LLMs, achieving a 2% latency reduction and 95% accuracy in detecting adversarial prompts.

Authors:Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, Rongcheng Tu, Xiao Luo, Wei Ju, Zhiping Xiao, Yifan Wang, Meng Xiao, Chenwu Liu, Jingyang Yuan, Shichang Zhang, Yiqiao Jin, Fan Zhang, Xian Wu, Hanqing Zhao, Dacheng Tao, Philip S. Yu, Ming Zhang
Title: Large Language Model Agent: A Survey on Methodology, Applications and Challenges
Abstract:
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models. Large Language Model (LLM) agents, with goal-driven behaviors and dynamic adaptation capabilities, potentially represent a critical pathway toward artificial general intelligence. This survey systematically deconstructs LLM agent systems through a methodology-centered taxonomy, linking architectural foundations, collaboration mechanisms, and evolutionary pathways. We unify fragmented research threads by revealing fundamental connections between agent design principles and their emergent behaviors in complex environments. Our work provides a unified architectural perspective, examining how agents are constructed, how they collaborate, and how they evolve over time, while also addressing evaluation methodologies, tool applications, practical challenges, and diverse application domains. By surveying the latest developments in this rapidly evolving field, we offer researchers a structured taxonomy for understanding LLM agents and identify promising directions for future research. The collection is available at https://github.com/luo-junyu/Awesome-Agent-Papers.
中文摘要:本综述通过以方法论为中心的分类体系,系统解构了LLM智能体系统的架构基础、协作机制与演化路径,为理解其发展提供了统一框架并指明了未来研究方向。
English Summary: This survey systematically analyzes LLM agent systems by examining their architecture, collaboration, and evolution, offering a unified framework for understanding their development and identifying future research directions.

Authors:Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen
Title: Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Abstract:
In recent years, the rapid development of large reasoning models has resulted in the saturation of existing benchmarks for evaluating mathematical reasoning, highlighting the urgent need for more challenging and rigorous evaluation frameworks. To address this gap, we introduce OlymMATH, a novel Olympiad-level mathematical benchmark, designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions. The problems are systematically organized into two distinct difficulty tiers: (1) AIME-level problems (easy) that establish a baseline for mathematical reasoning assessment, and (2) significantly more challenging problems (hard) designed to push the boundaries of current state-of-the-art models. In our benchmark, these problems span four core mathematical fields, each including a verifiable numerical solution to enable objective, rule-based evaluation. Empirical results underscore the significant challenge presented by OlymMATH, with state-of-the-art models including DeepSeek-R1, OpenAI's o3-mini and Gemini 2.5 Pro Exp demonstrating notably limited accuracy on the hard subset. Furthermore, the benchmark facilitates comprehensive bilingual assessment of mathematical reasoning abilities-a critical dimension that remains largely unaddressed in mainstream mathematical reasoning benchmarks. We release the benchmark, evaluation code, detailed results and a data visualization tool at https://github.com/RUCAIBox/OlymMATH.
中文: 针对现有数学推理基准饱和的问题,OlymMATH推出了一个具有挑战性的奥林匹克级别数学基准,包含200道双语题目和两个难度级别,揭示了顶尖模型的显著局限性,并支持全面的双语评估。
English: To address the saturation of existing benchmarks, OlymMATH introduces a challenging Olympiad-level mathematical benchmark with 200 bilingual problems across two difficulty tiers, revealing significant limitations in state-of-the-art models and enabling comprehensive bilingual evaluation.

Authors:Shuaijie She, Junxiao Liu, Yifeng Liu, Jiajun Chen, Xin Huang, Shujian Huang
Title: R-PRM: Reasoning-Driven Process Reward Modeling
Abstract:
Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy, which is further exacerbated by the scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 points in F1 scores, respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits more comprehensive evaluation and stronger generalization capabilities, thereby highlighting its significant potential.
中文: R-PRM通过利用有限标注生成种子数据、优化偏好和采用推理时扩展,显著提升了数学推理评估性能,在多个基准测试中远超现有方法。
English: R-PRM enhances mathematical reasoning evaluation by generating seed data from limited annotations, optimizing preferences, and employing inference-time scaling, achieving significant performance gains over existing methods.

Authors:Haoming Xu, Shuxun Wang, Yanqiu Zhao, Yi Zhong, Ziyan Jiang, Ningyuan Zhao, Shumin Deng, Huajun Chen, Ningyu Zhang
Title: ZJUKLAB at SemEval-2025 Task 4: Unlearning via Model Merging
Abstract:
This paper presents the ZJUKLAB team's submission for SemEval-2025 Task 4: Unlearning Sensitive Content from Large Language Models. This task aims to selectively erase sensitive knowledge from large language models, avoiding both over-forgetting and under-forgetting issues. We propose an unlearning system that leverages Model Merging (specifically TIES-Merging), combining two specialized models into a more balanced unlearned model. Our system achieves competitive results, ranking second among 26 teams, with an online score of 0.944 for Task Aggregate and 0.487 for overall Aggregate. In this paper, we also conduct local experiments and perform a comprehensive analysis of the unlearning process, examining performance trajectories, loss dynamics, and weight perspectives, along with several supplementary experiments, to understand the effectiveness of our method. Furthermore, we analyze the shortcomings of our method and evaluation metrics, emphasizing that MIA scores and ROUGE-based metrics alone are insufficient to fully evaluate successful unlearning. Finally, we emphasize the need for more comprehensive evaluation methodologies and rethinking of unlearning objectives in future research. Code is available at https://github.com/zjunlp/unlearn/tree/main/semeval25.
中文: ZJUKLAB团队针对SemEval-2025任务四提出了基于模型融合的遗忘系统,在26支队伍中荣获第二名,其方法在有效消除敏感内容的同时,揭示了当前评估指标的不足,并呼吁建立更全面的评估体系。
English: The ZJUKLAB team introduced a model merging-based unlearning system for SemEval-2025 Task 4, achieving second place by effectively removing sensitive content while highlighting limitations in current evaluation metrics and calling for improved assessment methods.

Authors:Ooha Lakkadi Reddy
Title: Rerouting Connection: Hybrid Computer Vision Analysis Reveals Visual Similarity Between Indus and Tibetan-Yi Corridor Writing Systems
Abstract:
This thesis employs a hybrid CNN-Transformer architecture, alongside a detailed anthropological framework, to investigate potential historical connections between the visual morphology of the Indus Valley script and pictographic systems of the Tibetan-Yi Corridor. Through an ensemble methodology of three target scripts across 15 independently trained models, we demonstrate that Tibetan-Yi Corridor scripts exhibit approximately six-fold higher visual similarity to the Indus script (0.635) than to the Bronze Age Proto-Cuneiform (0.102) or Proto-Elamite (0.078). Contrary to expectations, when measured through direct script-to-script embedding comparisons, the Indus script maps closer to Tibetan-Yi Corridor scripts with a mean cosine similarity of 0.930 (CI: [0.917, 0.942]) than to contemporaneous West Asian signaries, which recorded mean similarities of 0.887 (CI: [0.863, 0.911]) and 0.855 (CI: [0.818, 0.891]). Across dimensionality reduction and clustering methods, the Indus script consistently clusters closest to Tibetan-Yi Corridor scripts. These computational findings align with observed pictorial parallels in numeral systems, gender markers, and iconographic elements. Archaeological evidence of contact networks along the ancient Shu-Shendu road, coinciding with the Indus Civilization's decline, provides a plausible transmission pathway. While alternate explanations cannot be ruled out, the specificity and consistency of similarities suggest more complex cultural transmission networks between South and East Asia than previously recognized.
中文: 本研究采用混合CNN-Transformer模型,发现印度河文字与藏彝走廊文字在视觉形态上存在显著相似性,其关联强度远超同期西亚文字,暗示了沿古代通道可能存在文化传播网络。
English: This study uses a hybrid CNN-Transformer model to reveal that the Indus Valley script shares significantly stronger visual and structural similarities with Tibetan-Yi Corridor scripts than with contemporaneous West Asian scripts, suggesting potential historical cultural transmission along ancient pathways.

Authors:Sondos Mahmoud Bsharat, Mukul Ranjan, Aidar Myrzakhan, Jiacheng Liu, Bowei Guo, Shengkun Tang, Zhuang Liu, Yuanzhi Li, Zhiqiang Shen
Title: Mobile-MMLU: A Mobile Intelligence Language Understanding Benchmark
Abstract:
Rapid advancements in large language models (LLMs) have increased interest in deploying them on mobile devices for on-device AI applications. Mobile users interact differently with LLMs compared to desktop users, creating unique expectations and data biases. Current benchmark datasets primarily target at server and desktop environments, and there is a notable lack of extensive datasets specifically designed for mobile contexts. Additionally, mobile devices face strict limitations in storage and computing resources, constraining model size and capabilities, thus requiring optimized efficiency and prioritized knowledge. To address these challenges, we introduce Mobile-MMLU, a large-scale benchmark dataset tailored for mobile intelligence. It consists of 16,186 questions across 80 mobile-related fields, designed to evaluate LLM performance in realistic mobile scenarios. A challenging subset, Mobile-MMLU-Pro, provides advanced evaluation similar in size to MMLU-Pro but significantly more difficult than our standard full set. Both benchmarks use multiple-choice, order-invariant questions focused on practical mobile interactions, such as recipe suggestions, travel planning, and essential daily tasks. The dataset emphasizes critical mobile-specific metrics like inference latency, energy consumption, memory usage, and response quality, offering comprehensive insights into model performance under mobile constraints. Moreover, it prioritizes privacy and adaptability, assessing models' ability to perform on-device processing, maintain user privacy, and adapt to personalized usage patterns. Mobile-MMLU family offers a standardized framework for developing and comparing mobile-optimized LLMs, enabling advancements in productivity and decision-making within mobile computing environments. Our code and data are available at: https://github.com/VILA-Lab/Mobile-MMLU.
中文: Mobile-MMLU基准数据集专为移动智能设计,包含16,186个涵盖80个领域的问题,通过评估模型在资源受限环境下的延迟、能耗等关键指标,为移动端优化的大语言模型开发提供标准化框架。
English: The Mobile-MMLU benchmark dataset is introduced to evaluate large language models in mobile contexts, addressing unique challenges like resource constraints and user interaction biases with 16,186 questions across 80 fields, while prioritizing privacy and on-device performance metrics.

Authors:Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin
Title: Understanding R1-Zero-Like Training: A Critical Perspective
Abstract:
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.
中文摘要:本研究分析了R1-Zero训练的核心组件,发现GRPO存在优化偏差并提出无偏的Dr. GRPO方法,在保持推理性能的同时提升效率,最终通过极简配方使7B模型在AIME 2024上达到43.3%的最新最优准确率。
English Summary: This study analyzes R1-Zero training components and identifies optimization bias in GRPO, introducing Dr. GRPO to improve efficiency while achieving state-of-the-art 43.3% accuracy on AIME 2024 with a minimalist 7B model recipe.

Authors:Chenxi Wang, Jizhan Fang, Xiang Chen, Bozhong Tian, Ziwen Xu, Huajun Chen, Ningyu Zhang
Title: ADS-Edit: A Multimodal Knowledge Editing Dataset for Autonomous Driving Systems
Abstract:
Recent advancements in Large Multimodal Models (LMMs) have shown promise in Autonomous Driving Systems (ADS). However, their direct application to ADS is hindered by challenges such as misunderstanding of traffic knowledge, complex road conditions, and diverse states of vehicle. To address these challenges, we propose the use of Knowledge Editing, which enables targeted modifications to a model's behavior without the need for full retraining. Meanwhile, we introduce ADS-Edit, a multimodal knowledge editing dataset specifically designed for ADS, which includes various real-world scenarios, multiple data types, and comprehensive evaluation metrics. We conduct comprehensive experiments and derive several interesting conclusions. We hope that our work will contribute to the further advancement of knowledge editing applications in the field of autonomous driving. Code and data are available in https://github.com/zjunlp/EasyEdit/blob/main/examples/ADSEdit.md.
中文摘要:本研究提出知识编辑方法和ADS-Edit数据集,通过针对性修正模型行为来解决自动驾驶中交通知识误解与复杂路况问题,无需完整重新训练即可提升多模态大模型性能。
English Summary: This study introduces Knowledge Editing and the ADS-Edit dataset to enhance Large Multimodal Models for autonomous driving by addressing traffic knowledge gaps and complex road scenarios without full retraining.

Authors:Han Wang, Yongjie Ye, Bingru Li, Yuxiang Nie, Jinghui Lu, Jingqun Tang, Yanjie Wang, Can Huang
Title: Vision as LoRA
Abstract:
We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions. To further strengthen VoRA's visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs. All training data, codes, and model weights will be released at https://github.com/Hon-Wong/VoRA.
Chinese: VoRA是一种创新方法,通过将视觉专用LoRA层直接集成到大型语言模型中,使其转变为多模态模型,实现了推理过程中参数的无缝融合和任意分辨率输入的处理能力。
English: VoRA is a novel approach that transforms a large language model into a multimodal model by integrating vision-specific LoRA layers directly into the LLM, enabling seamless parameter merging and flexible input resolution processing during inference.

Authors:Han Wu, Yuxuan Yao, Shuqi Liu, Zehua Liu, Xiaojin Fu, Xiongwei Han, Xing Li, Hui-Ling Zhen, Tao Zhong, Mingxuan Yuan
Title: Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging
Abstract:
The transition from System 1 to System 2 reasoning in large language models (LLMs) has marked significant advancements in handling complex tasks through deliberate, iterative thinking. However, this progress often comes at the cost of efficiency, as models tend to overthink, generating redundant reasoning steps without proportional improvements in output quality. Long-to-Short (L2S) reasoning has emerged as a promising solution to this challenge, aiming to balance reasoning depth with practical efficiency. While existing approaches, such as supervised fine-tuning (SFT), reinforcement learning (RL), and prompt engineering, have shown potential, they are either computationally expensive or unstable. Model merging, on the other hand, offers a cost-effective and robust alternative by integrating the quick-thinking capabilities of System 1 models with the methodical reasoning of System 2 models. In this work, we present a comprehensive empirical study on model merging for L2S reasoning, exploring diverse methodologies, including task-vector-based, SVD-based, and activation-informed merging. Our experiments reveal that model merging can reduce average response length by up to 55% while preserving or even improving baseline performance. We also identify a strong correlation between model scale and merging efficacy with extensive evaluations on 1.5B/7B/14B/32B models. Furthermore, we investigate the merged model's ability to self-critique and self-correct, as well as its adaptive response length based on task complexity. Our findings highlight model merging as a highly efficient and effective paradigm for L2S reasoning, offering a practical solution to the overthinking problem while maintaining the robustness of System 2 reasoning. This work can be found on Github https://github.com/hahahawu/Long-to-Short-via-Model-Merging.
中文: 模型融合通过结合系统一的快速反应与系统二的深度推理,有效减少大语言模型的过度思考问题,能在保持或提升性能的同时将平均响应长度缩短高达55%。
English: Model merging offers an efficient solution to reduce overthinking in large language models by combining System 1's speed with System 2's depth, achieving up to 55% shorter responses while maintaining or improving performance.

Authors:Yijiong Yu
Title: Accelerate Parallelizable Reasoning via Parallel Decoding within One Sequence
Abstract:
Recent advances in reasoning models have demonstrated significant improvements in accuracy by employing detailed and comprehensive reasoning processes. However, generating these lengthy reasoning sequences is computationally expensive and time-consuming. To address this inefficiency, we leverage the inherent parallelizability of certain tasks to accelerate the reasoning process. Specifically, when multiple parallel reasoning steps exist, we decode multiple tokens per forward pass via a tree-like attention mask within a single sequence, avoiding additional memory usage. Experimental results show that our method achieves up to nearly 100\% speedup in decoding while basically maintaining the answer quality.
Chinese: 近期推理模型虽提升准确性但效率低下,我们通过树状注意力掩码并行化推理步骤,在不影响质量的情况下实现近100%的加速。
English: Recent advances in reasoning models improve accuracy but are inefficient, so we accelerate them by parallelizing steps with a tree-like attention mask, achieving nearly 100% speedup without compromising quality.

Authors:Jiale Cheng, Ruiliang Lyu, Xiaotao Gu, Xiao Liu, Jiazheng Xu, Yida Lu, Jiayan Teng, Zhuoyi Yang, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Title: VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Abstract:
Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.
中文: VPO是一个基于无害性、准确性和有益性三原则的提示优化框架,能忠实保留用户意图并显著提升生成视频的安全性和质量。
English: VPO is a principled framework that optimizes text prompts for video generation by ensuring harmlessness, accuracy, and helpfulness, significantly improving safety, alignment, and video quality across models.

Authors:Haoqin Tu, Weitao Feng, Hardy Chen, Hui Liu, Xianfeng Tang, Cihang Xie
Title: ViLBench: A Suite for Vision-Language Process Reward Modeling
Abstract:
Process-supervised reward models serve as a fine-grained function that provides detailed step-wise feedback to model responses, facilitating effective selection of reasoning trajectories for complex tasks. Despite its advantages, evaluation on PRMs remains less explored, especially in the multimodal domain. To address this gap, this paper first benchmarks current vision large language models (VLLMs) as two types of reward models: output reward models (ORMs) and process reward models (PRMs) on multiple vision-language benchmarks, which reveal that neither ORM nor PRM consistently outperforms across all tasks, and superior VLLMs do not necessarily yield better rewarding performance. To further advance evaluation, we introduce ViLBench, a vision-language benchmark designed to require intensive process reward signals. Notably, OpenAI's GPT-4o with Chain-of-Thought (CoT) achieves only 27.3% accuracy, indicating the benchmark's challenge for current VLLMs. Lastly, we preliminarily showcase a promising pathway towards bridging the gap between general VLLMs and reward models -- by collecting 73.6K vision-language process reward data using an enhanced tree-search algorithm, our 3B model is able to achieve an average improvement of 3.3% over standard CoT and up to 2.5% compared to its untrained counterpart on ViLBench by selecting OpenAI o1's generations. We release the implementations at https://ucsc-vlaa.github.io/ViLBench with our code, model, and data.
Chinese: 本文对视觉语言模型作为奖励模型进行了基准测试,发现其在不同任务中表现不一,并引入ViLBench评估过程奖励,表明即使先进模型也面临挑战,同时证明利用过程数据进行针对性训练可有效提升性能。
English: This paper benchmarks vision-language models as reward models, revealing inconsistent performance across tasks, and introduces ViLBench to evaluate process rewards, showing that even advanced models struggle while demonstrating that targeted training with process data can enhance performance.

Authors:Zhouhong Gu, Xingzhou Chen, Xiaoran Shi, Tao Wang, Suhang Zheng, Tianyu Li, Hongwei Feng, Yanghua Xiao
Title: GAPO: Learning Preferential Prompt through Generative Adversarial Policy Optimization
Abstract:
Recent advances in large language models have highlighted the critical need for precise control over model outputs through predefined constraints. While existing methods attempt to achieve this through either direct instruction-response synthesis or preferential response optimization, they often struggle with constraint understanding and adaptation. This limitation becomes particularly evident when handling fine-grained constraints, leading to either hallucination or brittle performance. We introduce Generative Adversarial Policy Optimization (GAPO), a novel framework that combines GAN-based training dynamics with an encoder-only reward model to progressively learn and adapt to increasingly complex constraints. GAPO leverages adversarial training to automatically generate training samples of varying difficulty while utilizing the encoder-only architecture to better capture prompt-response relationships. Extensive experiments demonstrate GAPO's superior performance across multiple benchmarks, particularly in scenarios requiring fine-grained constraint handling, where it significantly outperforms existing methods like PPO, DPO, and KTO. Our results suggest that GAPO's unique approach to preferential prompt learning offers a more robust and effective solution for controlling LLM outputs. Code is avaliable in https://github.com/MikeGu721/GAPO.
Chinese: 摘要介绍了生成对抗策略优化(GAPO),这是一种通过对抗性训练和仅编码器奖励模型来增强大语言模型处理细粒度约束能力的新框架,在多个基准测试中显著优于PPO和DPO等现有方法。
English: The abstract introduces Generative Adversarial Policy Optimization (GAPO), a novel framework that enhances large language models' ability to handle fine-grained constraints through adversarial training and an encoder-only reward model, outperforming existing methods like PPO and DPO in benchmarks.

Authors:Huanhuan Ma, Haisong Gong, Xiaoyuan Yi, Xing Xie, Dongkuan Xu
Title: Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs
Abstract:
Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: https://github.com/dependentsign/CSI.
中文摘要:本研究开发了核心情感量表(CSI),这是一种专为大型语言模型设计的双语心理评估工具,能够通过乐观、悲观和中立三个维度可靠评估模型情感倾向,相比现有方法显著提升了评估可靠性,并与实际输出情感呈现超过0.85的相关性。
English Summary: This study introduces the Core Sentiment Inventory (CSI), a bilingual psychological evaluation tool designed to reliably assess Large Language Models' emotional tendencies across optimism, pessimism, and neutrality dimensions, demonstrating superior reliability and over 0.85 correlation with real-world outputs compared to existing methods.

Authors:Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen
Title: LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
Abstract:
We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.
中文: LogQuant是一种创新的2位KV缓存量化技术,可在保持高性能的同时大幅降低大语言模型推理的内存占用,在同等压缩率下将复杂任务的准确率提升40%至200%,并提高吞吐量25%。
English: LogQuant is an innovative 2-bit KV Cache quantization method that significantly reduces memory usage in LLM inference while maintaining high performance, achieving up to 25% throughput improvement and 40-200% accuracy gains on complex tasks.

Authors:Ilias Stogiannidis, Steven McDonagh, Sotirios A. Tsaftaris
Title: Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models
Abstract:
Vision-Language Models (VLMs) have recently emerged as powerful tools, excelling in tasks that integrate visual and textual comprehension, such as image captioning, visual question answering, and image-text retrieval. However, existing benchmarks for VLMs include spatial components, which often fail to isolate spatial reasoning from related tasks such as object detection or semantic comprehension. In this paper, we address these deficiencies with a multi-faceted approach towards understanding spatial reasoning. Informed by the diverse and multi-dimensional nature of human spatial reasoning abilities, we present a detailed analysis that first delineates the core elements of spatial reasoning: spatial relations, orientation and navigation, mental rotation, and spatial visualization, and then assesses the performance of these models in both synthetic and real-world images, bridging controlled and naturalistic contexts. We analyze 13 state-of-the-art Vision-Language Models, uncovering pivotal insights into their spatial reasoning performance. Our results reveal profound shortcomings in current VLMs, with average accuracy across the 13 models approximating random chance, highlighting spatial reasoning as a persistent obstacle. This work not only exposes the pressing need to advance spatial reasoning within VLMs but also establishes a solid platform for future exploration. Code available on GitHub (https://github.com/stogiannidis/srbench) and dataset available on HuggingFace (https://huggingface.co/datasets/stogiannidis/srbench).
Chinese: 视觉语言模型在空间推理方面存在显著不足,其表现接近随机猜测水平,凸显了推进该领域发展的迫切需求。
English: Vision-Language Models show significant limitations in spatial reasoning, with their performance averaging near random chance, highlighting an urgent need for advancements in this area.

Authors:Yuxuan Hu, Xiaodong Chen, Cuiping Li, Hong Chen, Jing Zhang
Title: QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition
Abstract:
Large Language Models (LLMs) excel in diverse applications but suffer inefficiency due to massive scale. While quantization reduces computational costs, existing methods degrade accuracy in medium-sized LLMs (e.g., Llama-3-8B) due to activation outliers. To address this, we propose QUAD (Quantization with Activation Decomposition), a framework leveraging Singular Value Decomposition (SVD) to suppress activation outliers for effective 4-bit quantization. QUAD estimates activation singular vectors offline using calibration data to construct an orthogonal transformation matrix P, shifting outliers to additional dimensions in full precision while quantizing rest components to 4-bit. Additionally, QUAD enables parameter-efficient fine-tuning via adaptable full-precision outlier weights, narrowing the accuracy gap between quantized and full-precision models. Experiments demonstrate that QUAD achieves 94% ~ 96% accuracy under W4A4 quantization and 98% accuracy with W4A4/A8 and parameter-efficient fine-tuning for Llama-3 and Qwen-2.5 models. Our code is available at \href{https://github.com/hyx1999/Quad}{repository}.
中文摘要:QUAD框架利用奇异值分解抑制激活异常值,通过参数高效微调实现中型大语言模型的高效4位量化,同时保持高精度。
English Summary: The QUAD framework uses Singular Value Decomposition to suppress activation outliers, enabling efficient 4-bit quantization of medium-sized LLMs while maintaining high accuracy through parameter-efficient fine-tuning.

Authors:Chi-Chih Chang, Chien-Yu Lin, Yash Akhauri, Wei-Cheng Lin, Kai-Chiang Wu, Luis Ceze, Mohamed S. Abdelfattah
Title: xKV: Cross-Layer SVD for KV-Cache Compression
Abstract:
Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the Key and Value states (KV-Cache). Recent studies attempted to merge KV-cache from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on assumptions of high per-token cosine similarity across layers which generally does not hold in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a simple post-training method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 6.8x higher compression rates than state-of-the-art inter-layer technique while improving accuracy by 2.7%. Moreover, xKV is compatible with the emerging Multi-Head Latent Attention (MLA) (e.g., DeepSeek-Coder-V2), yielding a notable 3x compression rates on coding tasks without performance degradation. These results highlight xKV's strong capability and versatility in addressing memory bottlenecks for long-context LLM inference. Our code is publicly available at: https://github.com/abdelfattah-lab/xKV.
Chinese: xKV是一种后训练方法,通过奇异值分解将多层KV缓存压缩至共享低秩子空间,在长上下文基准测试中实现6.8倍压缩率提升,同时准确率提高2.7%。
English: xKV is a post-training method that uses Singular Value Decomposition to compress the KV-Cache across multiple layers, achieving up to 6.8x higher compression than existing techniques while improving accuracy by 2.7% on long-context benchmarks.

Authors:Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, Min Zhang
Title: AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration
Abstract:
Multi-agent systems (MAS) based on large language models (LLMs) have demonstrated significant potential in collaborative problem-solving. However, they still face substantial challenges of low communication efficiency and suboptimal task performance, making the careful design of the agents' communication topologies particularly important. Inspired by the management theory that roles in an efficient team are often dynamically adjusted, we propose AgentDropout, which identifies redundant agents and communication across different communication rounds by optimizing the adjacency matrices of the communication graphs and eliminates them to enhance both token efficiency and task performance. Compared to state-of-the-art methods, AgentDropout achieves an average reduction of 21.6% in prompt token consumption and 18.4% in completion token consumption, along with a performance improvement of 1.14 on the tasks. Furthermore, the extended experiments demonstrate that AgentDropout achieves notable domain transferability and structure robustness, revealing its reliability and effectiveness. We release our code at https://github.com/wangzx1219/AgentDropout.
中文:AgentDropout通过动态剔除冗余智能体与通信链路,有效提升多智能体系统的令牌效率与任务表现,并展现出优异的领域迁移性和结构鲁棒性。
English: AgentDropout enhances multi-agent systems by dynamically eliminating redundant agents and communication, achieving significant reductions in token usage and improved task performance.

Authors:Andrey Galichin, Alexey Dontsov, Polina Druzhinina, Anton Razzhigaev, Oleg Y. Rogov, Elena Tutubalina, Ivan Oseledets
Title: I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders
Abstract:
Recent LLMs like DeepSeek-R1 have demonstrated state-of-the-art performance by integrating deep thinking and complex reasoning during generation. However, the internal mechanisms behind these reasoning processes remain unexplored. We observe reasoning LLMs consistently use vocabulary associated with human reasoning processes. We hypothesize these words correspond to specific reasoning moments within the models' internal mechanisms. To test this hypothesis, we employ Sparse Autoencoders (SAEs), a technique for sparse decomposition of neural network activations into human-interpretable features. We introduce ReasonScore, an automatic metric to identify active SAE features during these reasoning moments. We perform manual and automatic interpretation of the features detected by our metric, and find those with activation patterns matching uncertainty, exploratory thinking, and reflection. Through steering experiments, we demonstrate that amplifying these features increases performance on reasoning-intensive benchmarks (+2.2%) while producing longer reasoning traces (+20.5%). Using the model diffing technique, we provide evidence that these features are present only in models with reasoning capabilities. Our work provides the first step towards a mechanistic understanding of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning
中文: 本研究通过稀疏自编码器引入ReasonScore识别推理大模型中的可解释特征,揭示了与不确定性和反思相关的内部机制,增强这些特征可提升推理性能。
English: This study introduces ReasonScore to identify interpretable features in reasoning LLMs using Sparse Autoencoders, revealing mechanisms for uncertainty and reflection that enhance reasoning performance when amplified.

Authors:Dayou Du, Shijie Cao, Jianyi Cheng, Luo Mai, Ting Cao, Mao Yang
Title: BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
Abstract:
The rise of long-context Large Language Models (LLMs) amplifies memory and bandwidth demands during autoregressive decoding, as the Key-Value (KV) cache grows with each generated token. Low-bit KV-cache quantization (e.g., 4-bit or 2-bit) can reduce memory footprint while preserving accuracy, but existing systems suffer from slow decoding due to their exclusive reliance on CUDA cores, neglecting Tensor Cores (the primary source of compute on modern GPUs). We present BitDecoding, a new long-context LLM inference system with a low-bit KV cache. BitDecoding enables efficient low-bit KV-cache decoding by cooperatively leveraging CUDA cores and Tensor Cores. It introduces methods for automatically inducing optimized layouts to exploit Tensor Cores, along with warp-level parallelization strategies for dequantization. For unified system support, BitDecoding includes a query transformation module supporting diverse attention variants, a quantization kernel that supports both tensor-wise and channel-wise scaling used in various quantization algorithms with high performance, and a dequantization kernel with a software-defined pipeline to coordinate CUDA and Tensor Cores execution for mixed-precision operations. Evaluated on RTX 4090, A100, and H100, BitDecoding accelerates decoding by up to 7.5x, 4.8x, and 8.9x, respectively, over FP16 FlashDecoding-v2, and surpasses the state-of-the-art low-bit system QServe by up to 4.3x. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x, showing substantial improvements for long-context generation. The code is available at https://github.com/DD-DuDa/BitDecoding.
中文: BitDecoding是一种创新的长上下文LLM推理系统,通过协同利用CUDA和Tensor核心优化低比特KV缓存解码,在保持精度的同时大幅提升了现有方法的解码速度。
English: BitDecoding is a novel long-context LLM inference system that optimizes low-bit KV-cache decoding by leveraging both CUDA and Tensor Cores, achieving significant speed improvements over existing methods while maintaining accuracy.

Authors:Danrui Li, Yichao Shi, Yaluo Wang, Ziying Shi, Mubbasir Kapadia
Title: ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models
Abstract:
Efficiently searching for relevant case studies is critical in architectural design, as designers rely on precedent examples to guide or inspire their ongoing projects. However, traditional text-based search tools struggle to capture the inherently visual and complex nature of architectural knowledge, often leading to time-consuming and imprecise exploration. This paper introduces ArchSeek, an innovative case study search system with recommendation capability, tailored for architecture design professionals. Powered by the visual understanding capabilities from vision-language models and cross-modal embeddings, it enables text and image queries with fine-grained control, and interaction-based design case recommendations. It offers architects a more efficient, personalized way to discover design inspirations, with potential applications across other visually driven design fields. The source code is available at https://github.com/danruili/ArchSeek.
中文摘要:ArchSeek是一种创新的建筑案例搜索系统,通过视觉语言模型实现精确的图文查询和个性化设计推荐,有效解决了传统文本搜索工具在捕捉建筑知识视觉复杂性方面的不足。
English Summary: ArchSeek is an innovative visual search system for architects that uses vision-language models to enable precise text and image queries, offering personalized design case recommendations to overcome the limitations of traditional text-based tools.

Authors:Yihan Wang, Peiyu Liu, Xin Yang
Title: LinkAlign: Scalable Schema Linking for Real-World Large-Scale Multi-Database Text-to-SQL
Abstract:
Schema linking is a critical bottleneck in applying existing Text-to-SQL models to real-world, large-scale, multi-database environments. Through error analysis, we identify two major challenges in schema linking: (1) Database Retrieval: accurately selecting the target database from a large schema pool, while effectively filtering out irrelevant ones; and (2) Schema Item Grounding: precisely identifying the relevant tables and columns within complex and often redundant schemas for SQL generation. Based on these, we introduce LinkAlign, a novel framework tailored for large-scale databases with thousands of fields. LinkAlign comprises three key steps: multi-round semantic enhanced retrieval and irrelevant information isolation for Challenge 1, and schema extraction enhancement for Challenge 2. Each stage supports both Agent and Pipeline execution modes, enabling balancing efficiency and performance via modular design. To enable more realistic evaluation, we construct AmbiDB, a synthetic dataset designed to reflect the ambiguity of real-world schema linking. Experiments on widely-used Text-to-SQL benchmarks demonstrate that LinkAlign consistently outperforms existing baselines on all schema linking metrics. Notably, it improves the overall Text-to-SQL pipeline and achieves a new state-of-the-art score of 33.09% on the Spider 2.0-Lite benchmark using only open-source LLMs, ranking first on the leaderboard at the time of submission. The codes are available at https://github.com/Satissss/LinkAlign
中文: 模式链接是文本到SQL模型在大规模数据库中的关键挑战,LinkAlign框架通过改进数据库检索和模式项定位来解决这一问题,在基准测试中达到了最优性能。
English: Schema linking is a key challenge in Text-to-SQL models for large-scale databases, addressed by the LinkAlign framework which enhances database retrieval and schema item grounding, achieving state-of-the-art performance on benchmarks.

Authors:Bin Li, Dehong Gao, Yeyuan Wang, Linbo Jin, Shanqing Yu, Xiaoyan Cai, Libin Yang
Title: Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models
Abstract:
Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to over-focus on certain irrelevant image tokens that do not contain critical information for answering the question and distort the output. To address this, we propose an Instruction-Aligned Visual Attention(IAVA) approach, which identifies irrelevant tokens by comparing changes in attention weights under two different instructions. By applying contrastive decoding, we dynamically adjust the logits generated from original image tokens and irrelevant image tokens, reducing the model's over-attention to irrelevant information. The experimental results demonstrate that IAVA consistently outperforms existing decoding techniques on benchmarks such as MME, POPE, and TextVQA in mitigating object hallucinations. Our IAVA approach is available online at https://github.com/Lee-lab558/IAVA.
中文摘要:提出的指令对齐视觉注意力(IAVA)方法通过对比解码动态调整对无关图像标记的关注,有效缓解大型视觉语言模型中的物体幻觉问题,在多项基准测试中均展现出优越性能。
English Summary: The proposed Instruction-Aligned Visual Attention (IAVA) method mitigates object hallucinations in Large Vision-Language Models by dynamically adjusting attention to irrelevant image tokens through contrastive decoding, demonstrating superior performance across multiple benchmarks.

Authors:Junyuan Gao, Jiahe Song, Jiang Wu, Runchuan Zhu, Guanlin Shen, Shasha Wang, Xingjian Wei, Haote Yang, Songyang Zhang, Weijia Li, Bin Wang, Dahua Lin, Lijun Wu, Conghui He
Title: PM4Bench: A Parallel Multilingual Multi-Modal Multi-task Benchmark for Large Vision Language Model
Abstract:
Existing multilingual benchmarks for Large Vision Language Models (LVLMs) suffer from limitations including language-specific content biases, disjointed multimodal input formats, and a lack of safety evaluation. To address these gaps, we propose PM4Bench, the first Parallel Multilingual Multi-Modal Multi-task Benchmark for LVLMs. PM4Bench features a parallel corpus design across 10 languages, enabling fair and accurate cross-lingual comparisons. It includes the vision setting where text and queries are embedded in images, requiring LVLMs to simultaneously "see", "read", and "think", aligning with real-world applications. Additionally, PM\textsuperscript{4}Bench incorporates safety evaluations, addressing critical oversight in existing multilingual benchmarks. Using PM4Bench, we evaluate 11 mainstream LVLMs, revealing significant cross-linguistic performance disparities, particularly in vision settings, and identifying OCR capability as a key determinant of these imbalances. We will release PM4Bench at https://github.com/opendatalab/PM4Bench .
中文: PM4Bench作为首个并行多语言多模态基准,通过包含10种语言的平行语料库、融合视觉与文本的任务及安全性评估,解决了现有大型视觉语言模型评测的不足,并揭示了与OCR能力相关的性能差异。
English: PM4Bench is introduced as the first parallel multilingual multi-modal benchmark addressing limitations in existing LVLM evaluations by featuring a 10-language parallel corpus, integrated vision-text tasks, and safety assessments, revealing performance disparities tied to OCR capabilities.

Authors:Wei Deng, Mengshi Qi, Huadong Ma
Title: Global-Local Tree Search in VLMs for 3D Indoor Scene Generation
Abstract:
Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .
中文: 本文提出一种新颖的全局-局部树搜索算法,通过将空间规划分解为层次化结构并利用表情符号网格提示,借助视觉语言模型生成合理的3D室内场景。
English: This paper introduces a novel global-local tree search algorithm that leverages Vision-Language Models to generate plausible 3D indoor scenes by decomposing spatial planning into hierarchical levels and using emoji-grid prompts for object placement.

Authors:Junteng Liu, Weihao Zeng, Xiwen Zhang, Yijun Wang, Zifei Shan, Junxian He
Title: On the Perception Bottleneck of VLMs for Chart Understanding
Abstract:
Chart understanding requires models to effectively analyze and reason about numerical data, textual elements, and complex visual components. Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process. In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, where the visual representation may fail to encapsulate the correct information, and the extraction bottleneck, where the language model struggles to extract the necessary information from the provided visual representations. Through comprehensive experiments, we find that (1) the information embedded within visual representations is substantially richer than what is typically captured by linear extractors, such as the widely used retrieval accuracy metric; (2) While instruction tuning effectively enhances the extraction capability of LVLMs, the vision encoder remains a critical bottleneck, demanding focused attention and improvement. Therefore, we further enhance the visual encoder to mitigate the vision encoder bottleneck under a contrastive learning framework. Empirical results demonstrate that our approach significantly mitigates the perception bottleneck and improves the ability of LVLMs to comprehend charts. Code is publicly available at https://github.com/hkust-nlp/Vision4Chart.
Chinese: 本研究揭示了大型视觉语言模型在图表理解中的感知瓶颈,归因于视觉编码器和信息提取的双重限制,并通过对比学习增强视觉编码器的方法,显著提升了模型性能。
English: This study identifies the perception bottleneck in large vision-language models for chart understanding, attributing it to limitations in both the vision encoder and information extraction, and proposes a contrastive learning-enhanced visual encoder that significantly improves model performance.

Authors:Cheng Huang, Fan Gao, Yutong Liu, Nyima Tashi, Xiangxiang Wang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Xiao Feng, Hao Wang, Yongbin Yu
Title: TIB-STC: A Large-Scale Structured Tibetan Benchmark for Low-Resource Language Modeling
Abstract:
Advancement of large language models (LLMs) has brought transformative capabilities to NLP, but such progress remains unevenly distributed, especially for low-resource and culturally rich languages like Tibetan. In this paper, we present TIB-STC, the first large-scale, expert-curated, and multi-domain dataset specifically designed to support the development and evaluation of LLMs for the Tibetan language. Spanning over 11 billion tokens across literature, religion, medicine, law, and daily communication, TIB-STC preserves traditional grammar and stylistic richness. To validate its utility, we train a reference model, Sun-Shine, on TIB-STC through a three-stage pipeline involving pretraining, supervised fine-tuning, and preference optimization. Evaluation on TLUE Benchmark for Tibetan-specific tasks, including Ti-MMLU and Ti-SafetyBench, demonstrates the TIB-STC's effectiveness in enabling robust instruction-following and culturally aligned generation. We release TIB-STC to advance research in low-resource language modeling and promote inclusivity in multilingual NLP. All data are available: https://github.com/Vicentvankor/sun-shine.
中文:TIB-STC数据集作为首个大规模、专家标注的多领域藏语资源,通过Sun-Shine模型验证了其在藏语特定任务中实现文化对齐生成与精准指令跟随的有效性,推动低资源语言建模发展。
English: The TIB-STC dataset, comprising over 11 billion tokens across diverse domains, is introduced to advance Tibetan language modeling, with the Sun-Shine model demonstrating its effectiveness in culturally aligned generation and robust instruction-following on Tibetan-specific benchmarks.

Authors:Massimo Bini, Leander Girrbach, Zeynep Akata
Title: DeLoRA: Decoupling Angles and Strength in Low-rank Adaptation
Abstract:
Parameter-Efficient FineTuning (PEFT) methods have recently gained significant popularity thanks to the widespread availability of large-scale pretrained models. These methods allow for quick adaptation to downstream tasks with minimal computational cost. However, popular finetuning methods such as LoRA exhibit limited robustness when it comes to hyperparameter choices or extended training regimes, preventing optimal out-of-the-box performance. In contrast, bounded approaches, such as ETHER, provide greater robustness but are limited to extremely low-rank adaptations and fixed-strength transformations, reducing their adaptation expressive power. In this work, we propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance. Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness. Code is available at https://github.com/ExplainableML/DeLoRA.
中文: DeLoRA是一种新颖的参数高效微调方法,通过将角度学习与适应强度解耦来增强鲁棒性,在多项任务中实现优于或持平现有方法的性能,同时保持更高的稳定性。
English: DeLoRA is a novel parameter-efficient fine-tuning method that enhances robustness by decoupling angular learning from adaptation strength, achieving superior or comparable performance across multiple tasks while maintaining greater stability than existing approaches.

Authors:Suman Adhya, Avishek Lahiri, Debarshi Kumar Sanyal, Partha Pratim Das
Title: Evaluating Negative Sampling Approaches for Neural Topic Models
Abstract:
Negative sampling has emerged as an effective technique that enables deep learning models to learn better representations by introducing the paradigm of learn-to-compare. The goal of this approach is to add robustness to deep learning models to learn better representation by comparing the positive samples against the negative ones. Despite its numerous demonstrations in various areas of computer vision and natural language processing, a comprehensive study of the effect of negative sampling in an unsupervised domain like topic modeling has not been well explored. In this paper, we present a comprehensive analysis of the impact of different negative sampling strategies on neural topic models. We compare the performance of several popular neural topic models by incorporating a negative sampling technique in the decoder of variational autoencoder-based neural topic models. Experiments on four publicly available datasets demonstrate that integrating negative sampling into topic models results in significant enhancements across multiple aspects, including improved topic coherence, richer topic diversity, and more accurate document classification. Manual evaluations also indicate that the inclusion of negative sampling into neural topic models enhances the quality of the generated topics. These findings highlight the potential of negative sampling as a valuable tool for advancing the effectiveness of neural topic models.
中文摘要:负采样通过对比正负样本提升神经主题模型的性能,实验证明其能显著增强主题连贯性、多样性及分类准确性,推动模型效果进步。
English Summary: Negative sampling enhances neural topic models by comparing positive and negative samples, significantly improving topic coherence, diversity, and classification accuracy as demonstrated in comprehensive experiments.

Authors:Varvara Krechetova, Denis Kochedykov
Title: GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks
Abstract:
In this paper, we establish a benchmark for evaluating large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess seven leading commercial LLMs (Sonnet 3.5 and 3.7, Haiku 3.5, Gemini 2.0, GPT-4o, GPT-4o mini, and o3-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks across four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test hallucination rejection. We develop an LLM-as-Judge evaluation framework to compare agent solutions against reference implementations. Results show Sonnet 3.5 and GPT-4o achieve the best overall performance, with Claude models excelling on solvable tasks while OpenAI models better identify unsolvable scenarios. We observe significant differences in token usage, with Anthropic models consuming substantially more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources, providing one more standardized method for ongoing evaluation of LLMs for GeoAI.
中文摘要:本文针对商业地理信息系统实践中的多步骤空间任务建立了大语言模型评估基准,结果显示Sonnet 3.5和GPT-4o综合表现最佳,同时发现不同模型在令牌使用效率和常见错误模式上存在显著差异。
English Summary: This paper establishes a benchmark for evaluating large language models on multi-step geospatial tasks, finding Sonnet 3.5 and GPT-4o deliver the best overall performance while revealing significant differences in token efficiency and common error patterns across models.

Authors:Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
Title: Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Abstract:
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
中文摘要:本文提出了一种重写驱动的增强范式(RAM),通过改写人类标注的训练数据直接生成未见过的观察-指令对,以无模拟器和省人工的方式解决了视觉语言导航领域的数据稀缺问题,在多个数据集上展现出卓越的泛化性能。
English Summary: This paper introduces a Rewriting-driven Augmentation (RAM) paradigm that addresses data scarcity in Vision-Language Navigation by generating unseen observation-instruction pairs through rewriting mechanisms, achieving superior generalization across multiple datasets without simulators or manual data cleaning.

Authors:Yiming Zhao, Yu Zeng, Yukun Qi, YaoYang Liu, Lin Chen, Zehui Chen, Xikun Bao, Jie Zhao, Feng Zhao
Title: V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
Abstract:
Large Vision-Language Models (LVLMs) have made significant progress in the field of video understanding recently. However, current benchmarks uniformly lean on text prompts for evaluation, which often necessitate complex referential language and fail to provide precise spatial and temporal references. This limitation diminishes the experience and efficiency of human-model interaction. To address this limitation, we propose the Video Visual Prompt Benchmark(V2P-Bench), a comprehensive benchmark specifically designed to evaluate LVLMs' video understanding capabilities in multimodal human-model interaction scenarios. V2P-Bench includes 980 unique videos and 1,172 QA pairs, covering 5 main tasks and 12 dimensions, facilitating instance-level fine-grained understanding aligned with human cognition. Benchmarking results reveal that even the most powerful models perform poorly on V2P-Bench (65.4% for GPT-4o and 67.9% for Gemini-1.5-Pro), significantly lower than the human experts' 88.3%, highlighting the current shortcomings of LVLMs in understanding video visual prompts. We hope V2P-Bench will serve as a foundation for advancing multimodal human-model interaction and video understanding evaluation. Project page: https://github.com/gaotiexinqu/V2P-Bench.
中文摘要:V2P-Bench基准通过多模态提示评估大视觉语言模型的视频理解能力,弥补了纯文本评估的不足,结果显示现有模型与人类专家表现存在显著差距。
English Summary: The V2P-Bench is introduced to address the limitations of text-only evaluations by assessing Large Vision-Language Models' video understanding through multimodal prompts, revealing significant performance gaps compared to human experts.

Authors:Suet-Ying Lam, Qingcheng Zeng, Jingyi Wu, Rob Voigt
Title: Leveraging Human Production-Interpretation Asymmetries to Test LLM Cognitive Plausibility
Abstract:
Whether large language models (LLMs) process language similarly to humans has been the subject of much theoretical and practical debate. We examine this question through the lens of the production-interpretation distinction found in human sentence processing and evaluate the extent to which instruction-tuned LLMs replicate this distinction. Using an empirically documented asymmetry between pronoun production and interpretation in humans for implicit causality verbs as a testbed, we find that some LLMs do quantitatively and qualitatively reflect human-like asymmetries between production and interpretation. We demonstrate that whether this behavior holds depends upon both model size-with larger models more likely to reflect human-like patterns and the choice of meta-linguistic prompts used to elicit the behavior. Our codes and results are available at https://github.com/LingMechLab/Production-Interpretation_Asymmetries_ACL2025.
中文:研究发现,某些大型语言模型在处理隐含因果动词的代词时,表现出与人类相似的语言生成与理解不对称性,这种特性受模型规模和提示方式影响。
English: Some large language models exhibit human-like asymmetries between language production and interpretation, influenced by model size and specific prompts, as demonstrated through pronoun processing with implicit causality verbs.

Authors:Louis Owen, Abhay Kumar, Nilabhra Roy Chowdhury, Fabian Güra
Title: Variance Control via Weight Rescaling in LLM Pre-training
Abstract:
The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.
中文: 层索引重缩放(LIR)和目标方差重缩放(TVR)技术通过优化权重初始化和方差控制,显著提升大语言模型预训练效果,改善下游任务表现并缓解激活异常问题。
English: The Layer Index Rescaling (LIR) and Target Variance Rescaling (TVR) techniques improve LLM pre-training by optimizing weight initialization and variance control, leading to enhanced downstream task performance and reduced activation issues.

Authors:Shu Pu, Yaochen Wang, Dongping Chen, Yuhang Chen, Guohao Wang, Qi Qin, Zhongyi Zhang, Zhiyuan Zhang, Zetong Zhou, Shuang Gong, Yi Gui, Yao Wan, Philip S. Yu
Title: Judge Anything: MLLM as a Judge Across Any Modality
Abstract:
Evaluating generative foundation models on open-ended multimodal understanding (MMU) and generation (MMG) tasks across diverse modalities (e.g., images, audio, video) poses significant challenges due to the complexity of cross-modal interactions. To this end, the idea of utilizing Multimodal LLMs (MLLMs) as automated judges has emerged, with encouraging results in assessing vision-language understanding tasks. Moving further, this paper extends MLLM-as-a-Judge across modalities to a unified manner by introducing two benchmarks, TaskAnything and JudgeAnything, to respectively evaluate the overall performance and judging capabilities of MLLMs across any-to-any modality tasks. Specifically, TaskAnything evaluates the MMU and MMG capabilities across 15 any-to-any modality categories, employing 1,500 queries curated from well-established benchmarks. Furthermore, JudgeAnything evaluates the judging capabilities of 5 advanced (e.g., GPT-4o and Gemini-2.0-Flash) from the perspectives of Pair Comparison and Score Evaluation, providing a standardized testbed that incorporates human judgments and detailed rubrics. Our extensive experiments reveal that while these MLLMs show promise in assessing MMU (i.e., achieving an average of 66.55% in Pair Comparison setting and 42.79% in Score Evaluation setting), they encounter significant challenges with MMG tasks (i.e., averaging only 53.37% in Pair Comparison setting and 30.05% in Score Evaluation setting), exposing cross-modality biases and hallucination issues. To address this, we present OmniArena, an automated platform for evaluating omni-models and multimodal reward models. Our work highlights the need for fairer evaluation protocols and stronger alignment with human preferences. The source code and dataset are publicly available at: https://urrealhero.github.io/judgeanythingweb/.
中文: 本文提出TaskAnything和JudgeAnything基准,用于评估多模态大语言模型在任意模态任务中的表现与评判能力,发现其在多模态理解方面表现良好但在生成任务中存在显著挑战,揭示了跨模态偏见问题。
English: This paper introduces TaskAnything and JudgeAnything benchmarks to evaluate multimodal LLMs' performance and judging capabilities across any-to-any modality tasks, revealing their strengths in multimodal understanding but significant challenges in generation tasks due to cross-modality biases.

Authors:Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, Zili Wang, Jian Yang, Wei Ye, Bo Zheng, Wangchunshu Zhou, Wenhao Huang, Sujian Li, Zhaoxiang Zhang
Title: A Comprehensive Survey on Long Context Language Modeling
Abstract:
Efficient processing of long contexts has been a persistent pursuit in Natural Language Processing. With the growing number of long documents, dialogues, and other textual data, it is important to develop Long Context Language Models (LCLMs) that can process and analyze extensive inputs in an effective and efficient way. In this paper, we present a comprehensive survey on recent advances in long-context modeling for large language models. Our survey is structured around three key aspects: how to obtain effective and efficient LCLMs, how to train and deploy LCLMs efficiently, and how to evaluate and analyze LCLMs comprehensively. For the first aspect, we discuss data strategies, architectural designs, and workflow approaches oriented with long context processing. For the second aspect, we provide a detailed examination of the infrastructure required for LCLM training and inference. For the third aspect, we present evaluation paradigms for long-context comprehension and long-form generation, as well as behavioral analysis and mechanism interpretability of LCLMs. Beyond these three key aspects, we thoroughly explore the diverse application scenarios where existing LCLMs have been deployed and outline promising future development directions. This survey provides an up-to-date review of the literature on long-context LLMs, which we wish to serve as a valuable resource for both researchers and engineers. An associated GitHub repository collecting the latest papers and repos is available at: \href{https://github.com/LCLM-Horizon/A-Comprehensive-Survey-For-Long-Context-Language-Modeling}{\color[RGB]{175,36,67}{LCLM-Horizon}}.
中文: 本文对长上下文语言模型的最新进展进行了全面综述,涵盖其开发、训练、部署和评估,并探讨了应用场景与未来发展方向。
English: This paper presents a comprehensive survey on recent advances in long-context language models, covering their development, training, deployment, and evaluation while exploring applications and future directions.

Authors:Yansi Li, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Qiuzhi Liu, Rui Wang, Zhuosheng Zhang, Zhaopeng Tu, Haitao Mi, Dong Yu
Title: Dancing with Critiques: Enhancing LLM Reasoning with Stepwise Natural Language Self-Critique
Abstract:
Enhancing the reasoning capabilities of large language models (LLMs), particularly for complex tasks requiring multi-step logical deductions, remains a significant challenge. Traditional inference time scaling methods utilize scalar reward signals from process reward models to evaluate candidate reasoning steps, but these scalar rewards lack the nuanced qualitative information essential for understanding and justifying each step. In this paper, we propose a novel inference-time scaling approach -- stepwise natural language self-critique (PANEL), which employs self-generated natural language critiques as feedback to guide the step-level search process. By generating rich, human-readable critiques for each candidate reasoning step, PANEL retains essential qualitative information, facilitating better-informed decision-making during inference. This approach bypasses the need for task-specific verifiers and the associated training overhead, making it broadly applicable across diverse tasks. Experimental results on challenging reasoning benchmarks, including AIME and GPQA, demonstrate that PANEL significantly enhances reasoning performance, outperforming traditional scalar reward-based methods. Our code is available at https://github.com/puddingyeah/PANEL to support and encourage future research in this promising field.
中文摘要:本文提出PANEL方法,通过自我生成的自然语言评语替代传统标量奖励来指导推理步骤,无需特定任务验证器即可显著提升大语言模型在复杂推理任务中的表现。
English Summary: This paper introduces PANEL, a novel inference-time scaling method that uses self-generated natural language critiques instead of scalar rewards to guide reasoning steps, significantly improving LLMs' performance on complex reasoning tasks without requiring task-specific verifiers.

Authors:Yihe Deng, Hritik Bansal, Fan Yin, Nanyun Peng, Wei Wang, Kai-Wei Chang
Title: OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Abstract:
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning, achieving notable performance gains on challenging visual reasoning tasks. While text-based reasoning models (e.g., Deepseek R1) show promising results in text-only tasks, distilling their reasoning into LVLMs via supervised fine-tuning (SFT) often results in performance degradation due to imprecise visual grounding. Conversely, purely reinforcement learning (RL)-based methods face a large search space, hindering the emergence of reflective behaviors in smaller models (e.g., 7B LVLMs). Surprisingly, alternating between SFT and RL ultimately results in significant performance improvements after a few iterations. Our analysis reveals that the base model rarely exhibits reasoning behaviors initially, but SFT effectively surfaces these latent actions and narrows the RL search space, accelerating the development of reasoning capabilities. Each subsequent RL stage further refines the model's reasoning skills, producing higher-quality SFT data for continued self-improvement. OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning, notably improving MathVista by 3.8%, EMMA by 2.4%, and HallusionBench by 1.6%. Beyond demonstrating the synergy between SFT and RL for complex reasoning tasks, our findings provide early evidence towards achieving R1-style reasoning in multimodal contexts. The code, model and data are held at https://github.com/yihedeng9/OpenVLThinker.
中文摘要:OpenVLThinker作为首个开源视觉语言大模型,通过交替使用监督微调与强化学习实现了卓越的链式推理能力,在多项视觉推理基准测试中取得显著性能突破。
English Summary: OpenVLThinker is an open-source vision-language model that achieves superior visual reasoning through alternating supervised fine-tuning and reinforcement learning, significantly advancing performance across multiple benchmarks.

Authors:Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito
Title: KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications
Abstract:
We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.
中文: KL3M分词器专为法律、金融和政府文本设计,对专业术语可减少高达83%的标记使用量,并提供字符级版本用于文本纠错,有效提升专业领域处理效率并降低计算需求。
English: The KL3M tokenizers are specialized for legal, financial, and governmental text, offering up to 83% fewer tokens for domain terms and character-level versions for text correction tasks, enhancing efficiency and computational savings in professional applications.

Authors:Anshumann, Mohd Abbas Zaidi, Akhil Kedia, Jinwoo Ahn, Taehwak Kwon, Kangwook Lee, Haejun Lee, Joohyung Lee
Title: Sparse Logit Sampling: Accelerating Knowledge Distillation in LLMs
Abstract:
Knowledge distillation can be a cost-effective technique to distill knowledge in Large Language Models, if the teacher output logits can be pre-computed and cached. However, successfully applying this to pre-training remains largely unexplored. In this work, we prove that naive approaches for sparse knowledge distillation such as caching Top-K probabilities, while intuitive, provide biased estimates of teacher probability distribution to the student, resulting in suboptimal performance and calibration. We propose an importance-sampling-based method `Random Sampling Knowledge Distillation', which provides unbiased estimates, preserves the gradient in expectation, and requires storing significantly sparser logits. Our method enables faster training of student models with marginal overhead (<10%) compared to cross-entropy based training, while maintaining competitive performance compared to full distillation, across a range of model sizes from 300M to 3B.
Chinese: 知识蒸馏在预计算教师模型输出时可高效传递大语言模型知识,但缓存Top-K概率等简单方法会产生偏差,导致性能不佳;我们提出的随机采样知识蒸馏方法提供无偏估计且仅需稀疏对数,能在模型规模从3亿到30亿范围内实现更快训练并保持竞争力。
English: Knowledge distillation can efficiently transfer knowledge from large language models if teacher logits are pre-computed, but naive methods like caching Top-K probabilities introduce bias, leading to suboptimal results; our proposed Random Sampling Knowledge Distillation offers unbiased estimates with sparse logits, enabling faster training and competitive performance across model sizes.

Authors:Mengsong Wu, Tong Zhu, Han Han, Xiang Zhang, Wenbiao Shao, Wenliang Chen
Title: Chain-of-Tools: Utilizing Massive Unseen Tools in the CoT Reasoning of Frozen Language Models
Abstract:
Tool learning can further broaden the usage scenarios of large language models (LLMs). However most of the existing methods either need to finetune that the model can only use tools seen in the training data, or add tool demonstrations into the prompt with lower efficiency. In this paper, we present a new Tool Learning method Chain-of-Tools. It makes full use of the powerful semantic representation capability of frozen LLMs to finish tool calling in CoT reasoning with a huge and flexible tool pool which may contain unseen tools. Especially, to validate the effectiveness of our approach in the massive unseen tool scenario, we construct a new dataset SimpleToolQuestions. We conduct experiments on two numerical reasoning benchmarks (GSM8K-XL and FuncQA) and two knowledge-based question answering benchmarks (KAMEL and SimpleToolQuestions). Experimental results show that our approach performs better than the baseline. We also identify dimensions of the model output that are critical in tool selection, enhancing the model interpretability. Our code and data are available at: https://github.com/fairyshine/Chain-of-Tools .
中文: Chain-of-Tools是一种新的工具学习方法,利用冻结大语言模型的强大语义表示能力,通过思维链推理处理海量可见及未见工具,在多个基准测试中优于基线方法并增强了模型可解释性。
English: Chain-of-Tools is a novel tool learning method that leverages frozen LLMs' semantic capabilities to handle both seen and unseen tools through CoT reasoning, outperforming baselines on multiple benchmarks while improving interpretability.

Authors:Massa Baali, Xiang Li, Hao Chen, Syed Abdul Hannan, Rita Singh, Bhiksha Raj
Title: CAARMA: Class Augmentation with Adversarial Mixup Regularization
Abstract:
Speaker verification is a typical zero-shot learning task, where inference of unseen classes is performed by comparing embeddings of test instances to known examples. The models performing inference must hence naturally generate embeddings that cluster same-class instances compactly, while maintaining separation across classes. In order to learn to do so, they are typically trained on a large number of classes (speakers), often using specialized losses. However real-world speaker datasets often lack the class diversity needed to effectively learn this in a generalizable manner. We introduce CAARMA, a class augmentation framework that addresses this problem by generating synthetic classes through data mixing in the embedding space, expanding the number of training classes. To ensure the authenticity of the synthetic classes we adopt a novel adversarial refinement mechanism that minimizes categorical distinctions between synthetic and real classes. We evaluate CAARMA on multiple speaker verification tasks, as well as other representative zero-shot comparison-based speech analysis tasks and obtain consistent improvements: our framework demonstrates a significant improvement of 8\% over all baseline models. The code is available at: https://github.com/massabaali7/CAARMA/
中文摘要:CAARMA是一个通过嵌入空间混合生成合成类别并采用对抗性优化确保真实性的类别增强框架,在说话人验证任务中相比基线模型实现了8%的性能提升。
English Summary: CAARMA is a class augmentation framework that enhances speaker verification by generating synthetic classes through embedding space mixing and adversarial refinement, achieving an 8% performance improvement over baseline models.

Authors:Tianze Luo, Xingchen Miao, Wenbo Duan
Title: WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
Abstract:
Flow matching offers a robust and stable approach to training diffusion models. However, directly applying flow matching to neural vocoders can result in subpar audio quality. In this work, we present WaveFM, a reparameterized flow matching model for mel-spectrogram conditioned speech synthesis, designed to enhance both sample quality and generation speed for diffusion vocoders. Since mel-spectrograms represent the energy distribution of waveforms, WaveFM adopts a mel-conditioned prior distribution instead of a standard Gaussian prior to minimize unnecessary transportation costs during synthesis. Moreover, while most diffusion vocoders rely on a single loss function, we argue that incorporating auxiliary losses, including a refined multi-resolution STFT loss, can further improve audio quality. To speed up inference without degrading sample quality significantly, we introduce a tailored consistency distillation method for WaveFM. Experiment results demonstrate that our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders, while enabling waveform generation in a single inference step.
中文: WaveFM通过采用梅尔谱条件先验和辅助损失来提升音频质量,并结合定制的一致性蒸馏方法,实现了在单步推理中快速生成高质量波形。
English: WaveFM enhances diffusion vocoders by using a mel-conditioned prior and auxiliary losses to improve audio quality and a tailored consistency distillation for faster, single-step waveform generation.

Authors:Xinyan Chen, Jiaxin Ge, Hongming Dai, Qiang Zhou, Qiuxuan Feng, Jingtong Hu, Yizhou Wang, Jiaming Liu, Shanghang Zhang
Title: EmpathyAgent: Can Embodied Agents Conduct Empathetic Actions?
Abstract:
Empathy is fundamental to human interactions, yet it remains unclear whether embodied agents can provide human-like empathetic support. Existing works have studied agents' tasks solving and social interactions abilities, but whether agents can understand empathetic needs and conduct empathetic behaviors remains overlooked. To address this, we introduce EmpathyAgent, the first benchmark to evaluate and enhance agents' empathetic actions across diverse scenarios. EmpathyAgent contains 10,000 multimodal samples with corresponding empathetic task plans and three different challenges. To systematically evaluate the agents' empathetic actions, we propose an empathy-specific evaluation suite that evaluates the agents' empathy process. We benchmark current models and found that exhibiting empathetic actions remains a significant challenge. Meanwhile, we train Llama3-8B using EmpathyAgent and find it can potentially enhance empathetic behavior. By establishing a standard benchmark for evaluating empathetic actions, we hope to advance research in empathetic embodied agents. Our code and data are publicly available at https://github.com/xinyan-cxy/EmpathyAgent.
中文: 该摘要介绍了EmpathyAgent,这是首个通过一万个多模态场景和专门评估套件来测评和增强具身智能体共情行为的基准,既揭示了现有模型的不足,也展示了通过Llama3-8B训练实现改进的潜力。
English: This abstract introduces EmpathyAgent, the first benchmark designed to evaluate and enhance empathetic behaviors in embodied agents through 10,000 multimodal scenarios and a specialized evaluation suite, revealing current models' limitations while demonstrating potential improvements via training with Llama3-8B.

Authors:Shuo Huang, Muhammad Umair Nasir, Steven James, Julian Togelius
Title: Word2Minecraft: Generating 3D Game Levels through Large Language Models
Abstract:
We present Word2Minecraft, a system that leverages large language models to generate playable game levels in Minecraft based on structured stories. The system transforms narrative elements-such as protagonist goals, antagonist challenges, and environmental settings-into game levels with both spatial and gameplay constraints. We introduce a flexible framework that allows for the customization of story complexity, enabling dynamic level generation. The system employs a scaling algorithm to maintain spatial consistency while adapting key game elements. We evaluate Word2Minecraft using both metric-based and human-based methods. Our results show that GPT-4-Turbo outperforms GPT-4o-Mini in most areas, including story coherence and objective enjoyment, while the latter excels in aesthetic appeal. We also demonstrate the system' s ability to generate levels with high map enjoyment, offering a promising step forward in the intersection of story generation and game design. We open-source the code at https://github.com/JMZ-kk/Word2Minecraft/tree/word2mc_v0
中文摘要:Word2Minecraft是一个利用大语言模型将结构化故事转化为可定制复杂度的可玩《我的世界》关卡的系统,其中GPT-4-Turbo在故事连贯性和游戏目标趣味性方面表现更优,相关代码已开源发布。
English Summary: Word2Minecraft is a system that uses large language models to convert structured stories into playable Minecraft levels with customizable complexity, demonstrating superior performance with GPT-4-Turbo in story coherence and player enjoyment while making the code publicly available.

Authors:Tidiane Camaret Ndir, Robin Tibor Schirrmeister, Tonio Ball
Title: EEG-CLIP : Learning EEG representations from natural language descriptions
Abstract:
Deep networks for electroencephalogram (EEG) decoding are often only trained to solve one specific task, such as pathology or age decoding. A more general task-agnostic approach is to train deep networks to match a (clinical) EEG recording to its corresponding textual medical report and vice versa. This approach was pioneered in the computer vision domain matching images and their text captions and subsequently allowed to do successful zero-shot decoding using textual class prompts. In this work, we follow this approach and develop a contrastive learning framework, EEG-CLIP, that aligns the EEG time series and the descriptions of the corresponding clinical text in a shared embedding space. We investigated its potential for versatile EEG decoding, evaluating performance in a range of few-shot and zero-shot settings. Overall, we show that EEG-CLIP manages to non-trivially align text and EEG representations. Our work presents a promising approach to learn general EEG representations, which could enable easier analyses of diverse decoding questions through zero-shot decoding or training task-specific models from fewer training examples. The code for reproducing our results is available at https://github.com/tidiane-camaret/EEGClip
中文: 本研究提出EEG-CLIP对比学习框架,通过将脑电图数据与临床文本映射到共享嵌入空间,实现了多功能的零样本和少样本解码,为更广泛的脑电分析应用开辟了新途径。
English: This study introduces EEG-CLIP, a contrastive learning framework that aligns EEG data with clinical text in a shared embedding space, enabling versatile zero-shot and few-shot decoding for broader EEG analysis applications.

Authors:Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Limin Han, Jiaojiao Zhao, Junting Guo, Zhenhong Long, Shu Yang, Meijuan An, Beibei Huang, Rongjia Du, Ning Wang, Kai Wang, Shiguo Lian
Title: Safety Evaluation and Enhancement of DeepSeek Models in Chinese Contexts
Abstract:
DeepSeek-R1, renowned for its exceptional reasoning capabilities and open-source strategy, is significantly influencing the global artificial intelligence landscape. However, it exhibits notable safety shortcomings. Recent research conducted by Robust Intelligence, a subsidiary of Cisco, in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1 achieves a 100\% attack success rate when processing harmful prompts. Furthermore, multiple security firms and research institutions have identified critical security vulnerabilities within the model. Although China Unicom has uncovered safety vulnerabilities of R1 in Chinese contexts, the safety capabilities of the remaining distilled models in the R1 series have not yet been comprehensively evaluated. To address this gap, this study utilizes the comprehensive Chinese safety benchmark CHiSafetyBench to conduct an in-depth safety evaluation of the DeepSeek-R1 series distilled models. The objective is to assess the safety capabilities of these models in Chinese contexts both before and after distillation, and to further elucidate the adverse effects of distillation on model safety. Building on these findings, we implement targeted safety enhancements for the entire DeepSeek-R1 model series. Evaluation results indicate that the enhanced models achieve significant improvements in safety while maintaining reasoning capabilities without notable degradation. We open-source the safety-enhanced models at https://github.com/UnicomAI/DeepSeek-R1-Safe to serve as a valuable resource for future research and optimization of DeepSeek models.
中文: DeepSeek-R1存在显著的安全缺陷,但本研究通过针对性增强在保持推理能力的同时大幅提升了其安全性,并将安全增强模型开源以供后续研究。
English: DeepSeek-R1 exhibits significant safety vulnerabilities, particularly with harmful prompts, but this study enhances its safety through targeted improvements while preserving reasoning capabilities, with the safety-enhanced models being open-sourced for further research.

Authors:Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, Song Han
Title: XAttention: Block Sparse Attention with Antidiagonal Scoring
Abstract:
Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks-including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation. XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications. Code is available at https://github.com/mit-han-lab/x-attention.
中文: XAttention是一种即插即用框架,通过使用注意力矩阵中反对角线值之和作为块重要性的代理,显著加速长上下文Transformer推理,在保持接近全注意力精度的同时实现高达13.5倍的计算加速。
English: XAttention is a plug-and-play framework that accelerates long-context Transformer inference by using the sum of antidiagonal values in attention matrices as a proxy for block importance, achieving near-full accuracy with up to 13.5x computational speedup.

Authors:Yang Sui, Yu-Neng Chuang, Guanchu Wang, Jiamu Zhang, Tianyi Zhang, Jiayi Yuan, Hongyi Liu, Andrew Wen, Shaochen Zhong, Na Zou, Hanjie Chen, Xia Hu
Title: Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks. Recent advancements in Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have further improved performance in System-2 reasoning domains like mathematics and programming by harnessing supervised fine-tuning (SFT) and reinforcement learning (RL) techniques to enhance the Chain-of-Thought (CoT) reasoning. However, while longer CoT reasoning sequences improve performance, they also introduce significant computational overhead due to verbose and redundant outputs, known as the "overthinking phenomenon". In this paper, we provide the first structured survey to systematically investigate and explore the current progress toward achieving efficient reasoning in LLMs. Overall, relying on the inherent mechanism of LLMs, we categorize existing works into several key directions: (1) model-based efficient reasoning, which considers optimizing full-length reasoning models into more concise reasoning models or directly training efficient reasoning models; (2) reasoning output-based efficient reasoning, which aims to dynamically reduce reasoning steps and length during inference; (3) input prompts-based efficient reasoning, which seeks to enhance reasoning efficiency based on input prompt properties such as difficulty or length control. Additionally, we introduce the use of efficient data for training reasoning models, explore the reasoning capabilities of small language models, and discuss evaluation methods and benchmarking. Project website: https://github.com/Eclipsess/Awesome-Efficient-Reasoning-LLMs
中文: 本综述系统研究提升大语言模型推理效率的方法,通过将现有工作分类为模型优化、动态步骤缩减和提示增强三大方向,以解决冗长推理链导致的计算效率问题。
English: This survey systematically investigates methods to enhance reasoning efficiency in Large Language Models by categorizing approaches into model optimization, dynamic step reduction, and prompt-based strategies, while addressing computational inefficiencies from verbose reasoning chains.

Authors:Yifan Sun, Han Wang, Dongbai Li, Gang Wang, Huan Zhang
Title: The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Abstract:
Benchmark Data Contamination (BDC)-the inclusion of benchmark testing samples in the training set-has raised increasing concerns in Large Language Model (LLM) evaluation, leading to falsely inflated performance estimates and undermining evaluation reliability. To address this, researchers have proposed various mitigation strategies to update existing benchmarks, including modifying original questions or generating new ones based on them. However, a rigorous examination of the effectiveness of these mitigation strategies remains lacking. In this paper, we design a systematic and controlled pipeline along with two novel metrics-fidelity and contamination resistance-to provide a fine-grained and comprehensive assessment of existing BDC mitigation strategies. Previous assessment methods, such as accuracy drop and accuracy matching, focus solely on aggregate accuracy, often leading to incomplete or misleading conclusions. Our metrics address this limitation by emphasizing question-level evaluation result matching. Extensive experiments with 10 LLMs, 5 benchmarks, 20 BDC mitigation strategies, and 2 contamination scenarios reveal that no existing strategy significantly improves resistance over the vanilla case (i.e., no benchmark update) across all benchmarks, and none effectively balances fidelity and contamination resistance. These findings underscore the urgent need for designing more effective BDC mitigation strategies. Our code repository is available at https://github.com/ASTRAL-Group/BDC_mitigation_assessment.
中文: 基准数据污染导致大语言模型评估性能虚高,尽管已有多种缓解策略,但我们的研究通过新指标发现这些策略均无法有效平衡保真度与抗污染能力,亟需开发更优方案。
English: Benchmark Data Contamination in LLM evaluation leads to inflated performance estimates, and while mitigation strategies exist, our study using novel metrics reveals that none effectively balance fidelity and contamination resistance, highlighting the need for better solutions.

Authors:Yunzhi Yao, Jizhan Fang, Jia-Chen Gu, Ningyu Zhang, Shumin Deng, Huajun Chen, Nanyun Peng
Title: CaKE: Circuit-aware Editing Enables Generalizable Knowledge Learners
Abstract:
Knowledge Editing (KE) enables the modification of outdated or incorrect information in large language models (LLMs). While existing KE methods can update isolated facts, they often fail to generalize these updates to multi-hop reasoning tasks that rely on the modified knowledge. Through an analysis of reasoning circuits -- the neural pathways LLMs use for knowledge-based inference, we find that current layer-localized KE approaches (e.g., MEMIT, WISE), which edit only single or a few model layers, inadequately integrate updated knowledge into these reasoning pathways. To address this limitation, we present CaKE (Circuit-aware Knowledge Editing), a novel method that enhances the effective integration of updated knowledge in LLMs. By only leveraging a few curated data samples guided by our circuit-based analysis, CaKE stimulates the model to develop appropriate reasoning circuits for newly incorporated knowledge. Experiments show that CaKE enables more accurate and consistent use of edited knowledge across related reasoning tasks, achieving an average improvement of 20% in multi-hop reasoning accuracy on the MQuAKE dataset while requiring less memory than existing KE methods. We release the code and data in https://github.com/zjunlp/CaKE.
中文摘要:CaKE是一种新颖的知识编辑方法,通过基于推理电路的分析,有效提升大语言模型对新知识的整合能力,在多跳推理任务中实现更高的准确性和一致性。
English Summary: CaKE is a novel knowledge editing method that improves the integration of updated knowledge into large language models by leveraging circuit-based analysis, resulting in enhanced multi-hop reasoning accuracy and efficiency.

Authors:Zhaowei Liu, Xin Guo, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Zixuan Wang, Jiajie Xu, Weige Cai, Ziwei Yang, Xueqian Zhao, Chao Li, Sheng Xu, Dezhi Chen, Yun Chen, Zuo Bai, Liwen Zhang
Title: Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning
Abstract:
Reasoning large language models are rapidly evolving across various domains. However, their capabilities in handling complex financial tasks still require in-depth exploration. In this paper, we introduce Fin-R1, a reasoning large language model specifically designed for the financial sector. Fin-R1 is built using a two-stage architecture, leveraging a financial reasoning dataset distilled and processed based on DeepSeek-R1. Through supervised fine-tuning (SFT) and reinforcement learning (RL) training, it demonstrates performance close to DeepSeek-R1 with a parameter size of 7 billion across a range of financial reasoning tasks. It achieves the state-of-the-art (SOTA) in the FinQA and ConvFinQA tasks between those LLMs in our evaluation, surpassing larger models in other tasks as well. Fin-R1 showcases strong reasoning and decision-making capabilities, providing solutions to various problems encountered in the financial domain. Our code is available at https://github.com/SUFE-AIFLM-Lab/Fin-R1.
中文:Fin-R1是一款专为金融领域设计的70亿参数推理大语言模型,通过两阶段训练方法在多项金融推理任务中实现了最先进的性能表现。
English: Fin-R1 is a specialized 7-billion-parameter reasoning large language model for the financial sector, achieving state-of-the-art performance in financial reasoning tasks through a two-stage training approach.

Authors:Quy-Anh Dang, Chris Ngo
Title: Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
Abstract:
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
中文: 本研究证明,强化学习能够以极低成本有效提升小型语言模型的推理能力,相比传统方法仅用少量资源就实现了显著的准确率提升。
English: This study demonstrates that reinforcement learning can efficiently enhance reasoning in small language models using minimal resources, achieving significant accuracy improvements at a fraction of the cost compared to conventional methods.

Authors:Qizhi Pei, Lijun Wu, Zhuoshi Pan, Yu Li, Honglin Lin, Chenlin Ming, Xin Gao, Conghui He, Rui Yan
Title: MathFusion: Enhancing Mathematical Problem-solving of LLM through Instruction Fusion
Abstract:
Large Language Models (LLMs) have shown impressive progress in mathematical reasoning. While data augmentation is promising to enhance mathematical problem-solving ability, current approaches are predominantly limited to instance-level modifications-such as rephrasing or generating syntactic variations-which fail to capture and leverage the intrinsic relational structures inherent in mathematical knowledge. Inspired by human learning processes, where mathematical proficiency develops through systematic exposure to interconnected concepts, we introduce MathFusion, a novel framework that enhances mathematical reasoning through cross-problem instruction synthesis. MathFusion implements this through three fusion strategies: (1) sequential fusion, which chains related problems to model solution dependencies; (2) parallel fusion, which combines analogous problems to reinforce conceptual understanding; and (3) conditional fusion, which creates context-aware selective problems to enhance reasoning flexibility. By applying these strategies, we generate a new dataset, \textbf{MathFusionQA}, followed by fine-tuning models (DeepSeekMath-7B, Mistral-7B, Llama3-8B) on it. Experimental results demonstrate that MathFusion achieves substantial improvements in mathematical reasoning while maintaining high data efficiency, boosting performance by 18.0 points in accuracy across diverse benchmarks while requiring only 45K additional synthetic instructions, representing a substantial improvement over traditional single-instruction approaches. Our datasets, models, and code are publicly available at https://github.com/QizhiPei/mathfusion.
Chinese: MathFusion提出了一种通过跨问题指令合成增强大语言模型数学推理能力的新框架,在保持高数据效率的同时显著提升了模型性能。
English: MathFusion introduces a novel framework that enhances mathematical reasoning in LLMs through cross-problem instruction synthesis, achieving significant performance gains while maintaining high data efficiency.

Authors:Mats Faulborn, Indira Sen, Max Pellert, Andreas Spitz, David Garcia
Title: Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models
Abstract:
Prompt-based language models like GPT4 and LLaMa have been used for a wide variety of use cases such as simulating agents, searching for information, or for content analysis. For all of these applications and others, political biases in these models can affect their performance. Several researchers have attempted to study political bias in language models using evaluation suites based on surveys, such as the Political Compass Test (PCT), often finding a particular leaning favored by these models. However, there is some variation in the exact prompting techniques, leading to diverging findings, and most research relies on constrained-answer settings to extract model responses. Moreover, the Political Compass Test is not a scientifically valid survey instrument. In this work, we contribute a political bias measured informed by political science theory, building on survey design principles to test a wide variety of input prompts, while taking into account prompt sensitivity. We then prompt 11 different open and commercial models, differentiating between instruction-tuned and non-instruction-tuned models, and automatically classify their political stances from 88,110 responses. Leveraging this dataset, we compute political bias profiles across different prompt variations and find that while PCT exaggerates bias in certain models like GPT3.5, measures of political bias are often unstable, but generally more left-leaning for instruction-tuned models. Code and data are available on: https://github.com/MaFa211/theory_grounded_pol_bias
中文: 本研究提出了一种基于政治学理论的偏见测量方法,通过评估11种语言模型在不同提示下的表现,发现指令调优模型普遍存在左倾偏见,同时揭示了传统政治指南针测试在评估中的不稳定性与夸大倾向。
English: This study introduces a political science-informed bias measurement method that evaluates 11 language models across various prompts, revealing generally left-leaning biases in instruction-tuned models and highlighting the instability and exaggeration in traditional Political Compass Test assessments.

Authors:Zhiyu Cao, Peifeng Li, Yaxin Fan, Qiaoming Zhu
Title: Incomplete Utterance Rewriting with Editing Operation Guidance and Utterance Augmentation
Abstract:
Although existing fashionable generation methods on Incomplete Utterance Rewriting (IUR) can generate coherent utterances, they often result in the inclusion of irrelevant and redundant tokens in rewritten utterances due to their inability to focus on critical tokens in dialogue context. Furthermore, the limited size of the training datasets also contributes to the insufficient training of the IUR model. To address the first issue, we propose a multi-task learning framework EO-IUR (Editing Operation-guided Incomplete Utterance Rewriting) that introduces the editing operation labels generated by sequence labeling module to guide generation model to focus on critical tokens. Furthermore, we introduce a token-level heterogeneous graph to represent dialogues. To address the second issue, we propose a two-dimensional utterance augmentation strategy, namely editing operation-based incomplete utterance augmentation and LLM-based historical utterance augmentation. The experimental results on three datasets demonstrate that our EO-IUR outperforms previous state-of-the-art (SOTA) baselines in both open-domain and task-oriented dialogue. The code will be available at https://github.com/Dewset/EO-IUR.
中文摘要:现有不完整话语改写方法因无法聚焦关键对话标记及训练数据有限,常生成冗余内容,而本文提出的EO-IUR框架通过引入编辑操作引导的多任务学习和二维话语增强策略,在多个数据集上实现了优于现有最优模型的性能表现。
English Summary: Current Incomplete Utterance Rewriting methods produce coherent but often redundant outputs due to insufficient focus on critical dialogue tokens and limited training data, which the proposed EO-IUR framework addresses through multi-task learning with editing operation guidance and a novel data augmentation strategy, outperforming state-of-the-art models across multiple datasets.

Authors:Zhihang Liu, Chen-Wei Xie, Pandeng Li, Liming Zhao, Longxiang Tang, Yun Zheng, Chuanbin Liu, Hongtao Xie
Title: Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models
Abstract:
Recent Multi-modal Large Language Models (MLLMs) have been challenged by the computational overhead resulting from massive video frames, often alleviated through compression strategies. However, the visual content is not equally contributed to user instructions, existing strategies (\eg, average pool) inevitably lead to the loss of potentially useful information. To tackle this, we propose the Hybrid-level Instruction Injection Strategy for Conditional Token Compression in MLLMs (HICom), utilizing the instruction as a condition to guide the compression from both local and global levels. This encourages the compression to retain the maximum amount of user-focused information while reducing visual tokens to minimize computational burden. Specifically, the instruction condition is injected into the grouped visual tokens at the local level and the learnable tokens at the global level, and we conduct the attention mechanism to complete the conditional compression. From the hybrid-level compression, the instruction-relevant visual parts are highlighted while the temporal-spatial structure is also preserved for easier understanding of LLMs. To further unleash the potential of HICom, we introduce a new conditional pre-training stage with our proposed dataset HICom-248K. Experiments show that our HICom can obtain distinguished video understanding ability with fewer tokens, increasing the performance by 2.43\% average on three multiple-choice QA benchmarks and saving 78.8\% tokens compared with the SOTA method. The code is available at https://github.com/lntzm/HICom.
Chinese Summary: 提出的HICom方法通过指令引导混合层级令牌压缩,在多模态大语言模型中优化视频压缩,聚焦用户相关信息以提升性能并显著降低计算负担。
English Summary: The proposed HICom method enhances video compression in Multi-modal Large Language Models by using instructions to guide hybrid-level token compression, improving performance and reducing computational load by focusing on user-relevant information.

Authors:Baolong Bi, Shenghua Liu, Yiwei Wang, Yilong Xu, Junfeng Fang, Lingrui Mei, Xueqi Cheng
Title: Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models
Abstract:
Retrieval-Augmented Generation (RAG) mitigates hallucinations in Large Language Models (LLMs) by integrating external knowledge. However, conflicts between parametric knowledge and retrieved context pose challenges, particularly when retrieved information is unreliable or the model's internal knowledge is outdated. In such cases, LLMs struggle to determine whether to rely more on their own parameters or the conflicted context. To address this, we propose **CK-PLUG**, a plug-and-play method for controlling LLMs' reliance on parametric and contextual knowledge. We introduce a novel knowledge consistency metric, Confidence Gain, which detects knowledge conflicts by measuring entropy shifts in token probability distributions after context insertion. CK-PLUG then enables fine-grained control over knowledge preference by adjusting the probability distribution of tokens with negative confidence gain through a single tuning parameter. Experiments demonstrate CK-PLUG's ability to significantly regulate knowledge reliance in counterfactual RAG scenarios while maintaining generation fluency and knowledge accuracy. For instance, on Llama3-8B, memory recall (MR) of RAG response can be adjusted within a broad range (9.9%-71.9%), compared to the baseline of 42.1%. Moreover, CK-PLUG supports adaptive control based on the model's confidence in both internal and external knowledge, achieving consistent performance improvements across various general RAG tasks. Our code is available at: $\href{https://github.com/byronBBL/CK-PLUG}{\text{this https URL}}$.
中文: 提出的CK-PLUG方法通过检测熵移的知识冲突并调整标记概率,动态控制大语言模型对参数化知识与上下文知识的依赖程度,在保持生成质量的同时实现了对知识偏好的有效调控,并在多种RAG场景中取得一致性能提升。
English: The proposed CK-PLUG method dynamically controls Large Language Models' reliance on parametric versus contextual knowledge by detecting knowledge conflicts through entropy shifts and adjusting token probabilities, achieving significant regulation of knowledge preference while maintaining generation quality across various RAG scenarios.

Authors:DongGeon Lee, Ahjeong Park, Hyeri Lee, Hyeonseo Nam, Yunho Maeng
Title: Typed-RAG: Type-Aware Decomposition of Non-Factoid Questions for Retrieval-Augmented Generation
Abstract:
Addressing non-factoid question answering (NFQA) remains challenging due to its open-ended nature, diverse user intents, and need for multi-aspect reasoning. These characteristics often reveal the limitations of conventional retrieval-augmented generation (RAG) approaches. To overcome these challenges, we propose Typed-RAG, a framework for type-aware decomposition of non-factoid questions (NFQs) within the RAG paradigm. Specifically, Typed-RAG first classifies an NFQ into a predefined type (e.g., Debate, Experience, Comparison). It then decomposes the question into focused sub-queries, each focusing on a single aspect. This decomposition enhances both retrieval relevance and answer quality. By combining the results of these sub-queries, Typed-RAG produces more informative and contextually aligned responses. Additionally, we construct Wiki-NFQA, a benchmark dataset for NFQA covering a wide range of NFQ types. Experiments show that Typed-RAG consistently outperforms existing QA approaches based on LLMs or RAG methods, validating the effectiveness of type-aware decomposition for improving both retrieval quality and answer generation in NFQA. Our code and dataset are available on https://github.com/TeamNLP/Typed-RAG.
Chinese: Typed-RAG框架通过将非事实性问题分类为预定义类型并分解为聚焦子查询,提升了检索相关性和答案质量,在Wiki-NFQA基准测试中的实验验证了其有效性。
English: The Typed-RAG framework enhances non-factoid question answering by classifying questions into predefined types and decomposing them into focused sub-queries, improving retrieval relevance and answer quality, as validated by experiments on the Wiki-NFQA benchmark.

Authors:Tsunehiko Tanaka, Edgar Simo-Serra
Title: Grammar and Gameplay-aligned RL for Game Description Generation with LLMs
Abstract:
Game Description Generation (GDG) is the task of generating a game description written in a Game Description Language (GDL) from natural language text. Previous studies have explored generation methods leveraging the contextual understanding capabilities of Large Language Models (LLMs); however, accurately reproducing the game features of the game descriptions remains a challenge. In this paper, we propose reinforcement learning-based fine-tuning of LLMs for GDG (RLGDG). Our training method simultaneously improves grammatical correctness and fidelity to game concepts by introducing both grammar rewards and concept rewards. Furthermore, we adopt a two-stage training strategy where Reinforcement Learning (RL) is applied following Supervised Fine-Tuning (SFT). Experimental results demonstrate that our proposed method significantly outperforms baseline methods using SFT alone. Our code is available at https://github.com/tsunehiko/rlgdg
中文: 本文提出RLGDG方法,通过结合语法奖励和概念奖励的强化学习微调大语言模型,显著提升了游戏描述生成的准确性和概念保真度,效果优于单纯监督微调。
English: This paper introduces RLGDG, a reinforcement learning-based fine-tuning method for Large Language Models that enhances Game Description Generation by combining grammar and concept rewards, significantly outperforming supervised fine-tuning alone.

Authors:Federico Cocchi, Nicholas Moratelli, Davide Caffagni, Sara Sarto, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara
Title: LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Abstract:
Recent progress in Multimodal Large Language Models (MLLMs) has highlighted the critical roles of both the visual backbone and the underlying language model. While prior work has primarily focused on scaling these components to billions of parameters, the trade-offs between model size, architecture, and performance remain underexplored. Additionally, inconsistencies in training data and evaluation protocols have hindered direct comparisons, making it difficult to derive optimal design choices. In this paper, we introduce LLaVA-MORE, a new family of MLLMs that integrates recent language models with diverse visual backbones. To ensure fair comparisons, we employ a unified training protocol applied consistently across all architectures. Our analysis systematically explores both small- and medium-scale LLMs -- including Phi-4, LLaMA-3.1, and Gemma-2 -- to evaluate multimodal reasoning, generation, and instruction following, while examining the relationship between model size and performance. Beyond evaluating the LLM impact on final results, we conduct a comprehensive study of various visual encoders, ranging from CLIP-based architectures to alternatives such as DINOv2, SigLIP, and SigLIP2. Additional experiments investigate the effects of increased image resolution and variations in pre-training datasets. Overall, our results provide insights into the design of more effective MLLMs, offering a reproducible evaluation framework that facilitates direct comparisons and can guide future model development. Our source code and trained models are publicly available at: https://github.com/aimagelab/LLaVA-MORE.
Chinese: 本文介绍了LLaVA-MORE系列多模态大语言模型,通过统一训练协议系统评估语言模型与视觉骨干网络的相互作用,为设计更有效的MLLM提供了见解。
English: The paper introduces LLaVA-MORE, a family of multimodal large language models that systematically evaluates the interplay between language models and visual backbones using a unified training protocol to provide insights for designing more effective MLLMs.

Authors:Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora
Title: What Makes a Reward Model a Good Teacher? An Optimization Perspective
Abstract:
The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model. However, while this quality is primarily evaluated through accuracy, it remains unclear whether accuracy fully captures what makes a reward model an effective teacher. We address this question from an optimization perspective. First, we prove that regardless of how accurate a reward model is, if it induces low reward variance, then the RLHF objective suffers from a flat landscape. Consequently, even a perfectly accurate reward model can lead to extremely slow optimization, underperforming less accurate models that induce higher reward variance. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another. These results establish a fundamental limitation of evaluating reward models solely based on accuracy or independently of the language model they guide. Experiments using models of up to 8B parameters corroborate our theory, demonstrating the interplay between reward variance, accuracy, and reward maximization rate. Overall, our findings highlight that beyond accuracy, a reward model needs to induce sufficient variance for efficient~optimization.
Chinese: 基于人类反馈的强化学习(RLHF)的成功不仅取决于奖励模型的准确性,还要求其能产生足够的奖励方差,因为低方差会导致优化景观平坦和学习缓慢,即使模型准确性很高。
English: The effectiveness of Reinforcement Learning from Human Feedback (RLHF) relies not only on the accuracy of the reward model but also on its ability to induce sufficient reward variance, as low variance can lead to a flat optimization landscape and slow learning, even with high accuracy.

Authors:Tongyao Zhu, Qian Liu, Haonan Wang, Shiqi Chen, Xiangming Gu, Tianyu Pang, Min-Yen Kan
Title: SkyLadder: Better and Faster Pretraining via Context Window Scheduling
Abstract:
Recent advancements in LLM pretraining have featured ever-expanding context windows to process longer sequences. However, our pilot study reveals that models pretrained with shorter context windows consistently outperform their long-context counterparts under a fixed token budget. This finding motivates us to explore an optimal context window scheduling strategy to better balance long-context capability with pretraining efficiency. To this end, we propose SkyLadder, a simple yet effective approach that implements a short-to-long context window transition. SkyLadder preserves strong standard benchmark performance, while matching or exceeding baseline results on long context tasks. Through extensive experiments, we pre-train 1B-parameter models (up to 32K context) and 3B-parameter models (8K context) on 100B tokens, demonstrating that SkyLadder yields consistent gains of up to 3.7% on common benchmarks, while achieving up to 22% faster training speeds compared to baselines. The code is at https://github.com/sail-sg/SkyLadder.
中文摘要:该研究提出SkyLadder方法,通过在预训练中采用从短到长的上下文窗口过渡策略,在保持强大长文本处理能力的同时,实现了基准测试性能提升最高达3.7%,训练速度比基线方法快22%。
English Summary: The study introduces SkyLadder, a method that transitions from short to long context windows during pretraining, achieving up to 3.7% better performance on benchmarks and 22% faster training than baselines while maintaining strong long-context capabilities.

Authors:Yang Tan, Chen Liu, Jingyuan Gao, Banghao Wu, Mingchen Li, Ruilin Wang, Lingrong Zhang, Huiqun Yu, Guisheng Fan, Liang Hong, Bingxin Zhou
Title: VenusFactory: A Unified Platform for Protein Engineering Data Retrieval and Language Model Fine-Tuning
Abstract:
Natural language processing (NLP) has significantly influenced scientific domains beyond human language, including protein engineering, where pre-trained protein language models (PLMs) have demonstrated remarkable success. However, interdisciplinary adoption remains limited due to challenges in data collection, task benchmarking, and application. This work presents VenusFactory, a versatile engine that integrates biological data retrieval, standardized task benchmarking, and modular fine-tuning of PLMs. VenusFactory supports both computer science and biology communities with choices of both a command-line execution and a Gradio-based no-code interface, integrating $40+$ protein-related datasets and $40+$ popular PLMs. All implementations are open-sourced on https://github.com/tyang816/VenusFactory.
中文: VenusFactory是一个多功能引擎,通过整合生物数据检索、任务基准测试和模块化微调,解决了蛋白质语言建模中的跨学科挑战,并提供命令行和无代码界面及丰富的数据集与模型。
English: VenusFactory is a versatile engine that addresses interdisciplinary challenges in protein language modeling by integrating biological data retrieval, task benchmarking, and modular fine-tuning, offering both command-line and no-code interfaces with extensive datasets and models.

Authors:Junnan Zhu, Min Xiao, Yining Wang, Feifei Zhai, Yu Zhou, Chengqing Zong
Title: TROVE: A Challenge for Fine-Grained Text Provenance via Source Sentence Tracing and Relationship Classification
Abstract:
LLMs have achieved remarkable fluency and coherence in text generation, yet their widespread adoption has raised concerns about content reliability and accountability. In high-stakes domains, it is crucial to understand where and how the content is created. To address this, we introduce the Text pROVEnance (TROVE) challenge, designed to trace each sentence of a target text back to specific source sentences within potentially lengthy or multi-document inputs. Beyond identifying sources, TROVE annotates the fine-grained relationships (quotation, compression, inference, and others), providing a deep understanding of how each target sentence is formed. To benchmark TROVE, we construct our dataset by leveraging three public datasets covering 11 diverse scenarios (e.g., QA and summarization) in English and Chinese, spanning source texts of varying lengths (0-5k, 5-10k, 10k+), emphasizing the multi-document and long-document settings essential for provenance. To ensure high-quality data, we employ a three-stage annotation process: sentence retrieval, GPT-4o provenance, and human provenance. We evaluate 11 LLMs under direct prompting and retrieval-augmented paradigms, revealing that retrieval is essential for robust performance, larger models perform better in complex relationship classification, and closed-source models often lead, yet open-source models show significant promise, particularly with retrieval augmentation. We make our dataset available here: https://github.com/ZNLP/ZNLP-Dataset.
中文: TROVE挑战通过将目标句子溯源至具体来源并标注细粒度关系,评估表明检索增强和大模型在复杂场景中提升性能,同时开源模型展现出潜力。
English: The TROVE challenge is introduced to trace text provenance by linking target sentences to their sources and annotating fine-grained relationships, with evaluations showing retrieval augmentation and larger models enhance performance in complex scenarios.

Authors:David Wan, Justin Chih-Yao Chen, Elias Stengel-Eskin, Mohit Bansal
Title: MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
Abstract:
Multi-agent collaboration among models has shown promise in reasoning tasks but is underexplored in long-form generation tasks like summarization and question-answering. We extend multi-agent multi-model reasoning to generation, specifically to improving faithfulness through refinement, i.e., revising model-generated outputs to remove factual inconsistencies. We investigate how iterative collaboration among multiple instances and types of large language models (LLMs) enhances subtasks in the refinement process, such as error detection, critiquing unfaithful sentences, and making corrections based on critiques. We design intrinsic evaluations for each subtask, with our findings indicating that both multi-agent (multiple instances) and multi-model (diverse LLM types) approaches benefit error detection and critiquing. Additionally, reframing critiquing and refinement as reranking rather than generation tasks improves multi-agent performance. We consolidate these insights into a final "recipe" called Multi-Agent Multi-Model Refinement (MAMM-Refine), where multi-agent and multi-model collaboration significantly boosts performance on three summarization datasets as well as on long-form question answering, demonstrating the effectiveness and generalizability of our recipe.
Chinese Summary: 多智能体多模型协作通过迭代检测和修正错误,显著提升了摘要和长问答等长文本生成任务中的忠实度,其中MAMM-Refine方法在多个数据集上验证了其有效性和泛化能力。
English Summary: Multi-agent multi-model collaboration enhances long-form generation tasks by refining outputs for improved faithfulness, with the MAMM-Refine method demonstrating significant performance gains in summarization and question-answering through iterative error detection and correction.

Authors:Chentian Wei, Jiewei Chen, Jinzhu Xu
Title: Exploring Large Language Models for Word Games:Who is the Spy?
Abstract:
Word games hold significant research value for natural language processing (NLP), game theory, and related fields due to their rule-based and situational nature. This study explores how large language models (LLMs) can be effectively involved in word games and proposes a training-free framework. "Shei Shi Wo Di" or "Who is the Spy" in English, is a classic word game. Using this game as an example, we introduce a Chain-of-Thought (CoT)-based scheduling framework to enable LLMs to achieve excellent performance in tasks such as inferring role words and disguising their identities. We evaluate the framework's performance based on game success rates and the accuracy of the LLM agents' analytical results. Experimental results affirm the framework's effectiveness, demonstrating notable improvements in LLM performance across multiple datasets. This work highlights the potential of LLMs in mastering situational reasoning and social interactions within structured game environments. Our code is publicly available at https://github.com/ct-wei/Who-is-The-Spy.
中文: 本研究提出了一种无需训练的思维链调度框架,使大语言模型在词语游戏“谁是卧底”中表现出色,有效提升了情境推理和社交互动能力。
English: This study introduces a training-free framework using Chain-of-Thought reasoning to enable large language models to excel in the word game "Who is the Spy," demonstrating improved performance in situational reasoning and social interaction tasks.

Authors:Àlex Pujol Vidal, Sergio Escalera, Kamal Nasrollahi, Thomas B. Moeslund
Title: Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU
Abstract:
Machine unlearning methods have become increasingly important for selective concept removal in large pre-trained models. While recent work has explored unlearning in Euclidean contrastive vision-language models, the effectiveness of concept removal in hyperbolic spaces remains unexplored. This paper investigates machine unlearning in hyperbolic contrastive learning by adapting Alignment Calibration to MERU, a model that embeds images and text in hyperbolic space to better capture semantic hierarchies. Through systematic experiments and ablation studies, we demonstrate that hyperbolic geometry offers distinct advantages for concept removal, achieving near perfect forgetting with reasonable performance on retained concepts, particularly when scaling to multiple concept removal. Our approach introduces hyperbolic-specific components including entailment calibration and norm regularization that leverage the unique properties of hyperbolic space. Comparative analysis with Euclidean models reveals fundamental differences in unlearning dynamics, with hyperbolic unlearning reorganizing the semantic hierarchy while Euclidean approaches merely disconnect cross-modal associations. These findings not only advance machine unlearning techniques but also provide insights into the geometric properties that influence concept representation and removal in multimodal models. Source code available at https://github.com/alex-pv01/HAC
中文: 本文提出一种针对双曲对比学习模型的机器遗忘方法,通过双曲空间特有的技术重组语义层次结构,在保持保留概念性能的同时实现了更优越的概念消除效果。
English: This paper introduces a machine unlearning method for hyperbolic contrastive learning models, demonstrating superior concept removal through hyperbolic-specific techniques that reorganize semantic hierarchies while maintaining performance on retained concepts.

Authors:Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, Dacheng Tao
Title: Towards Understanding the Safety Boundaries of DeepSeek Models: Evaluation and Findings
Abstract:
This study presents the first comprehensive safety evaluation of the DeepSeek models, focusing on evaluating the safety risks associated with their generated content. Our evaluation encompasses DeepSeek's latest generation of large language models, multimodal large language models, and text-to-image models, systematically examining their performance regarding unsafe content generation. Notably, we developed a bilingual (Chinese-English) safety evaluation dataset tailored to Chinese sociocultural contexts, enabling a more thorough evaluation of the safety capabilities of Chinese-developed models. Experimental results indicate that despite their strong general capabilities, DeepSeek models exhibit significant safety vulnerabilities across multiple risk dimensions, including algorithmic discrimination and sexual content. These findings provide crucial insights for understanding and improving the safety of large foundation models. Our code is available at https://github.com/NY1024/DeepSeek-Safety-Eval.
中文总结:本研究首次对DeepSeek模型进行全面安全评估,发现尽管具备强大通用能力,这些模型在多个风险维度仍存在显著安全漏洞。
English Summary: This study conducts the first comprehensive safety evaluation of DeepSeek models, revealing significant safety vulnerabilities across multiple risk dimensions despite their strong general capabilities.

Authors:Haoyi Li, Angela Yifei Yuan, Soyeon Caren Han, Christopher Leckie
Title: SPADE: Structured Prompting Augmentation for Dialogue Enhancement in Machine-Generated Text Detection
Abstract:
The increasing capability of large language models (LLMs) to generate synthetic content has heightened concerns about their misuse, driving the development of Machine-Generated Text (MGT) detection models. However, these detectors face significant challenges due to the lack of high-quality synthetic datasets for training. To address this issue, we propose SPADE, a structured framework for detecting synthetic dialogues using prompt-based positive and negative samples. Our proposed methods yield 14 new dialogue datasets, which we benchmark against eight MGT detection models. The results demonstrate improved generalization performance when utilizing a mixed dataset produced by proposed augmentation frameworks, offering a practical approach to enhancing LLM application security. Considering that real-world agents lack knowledge of future opponent utterances, we simulate online dialogue detection and examine the relationship between chat history length and detection accuracy. Our open-source datasets, code and prompts can be downloaded from https://github.com/AngieYYF/SPADE-customer-service-dialogue.
中文:SPADE框架通过基于提示的正负样本提出结构化合成对话检测方法,构建的新数据集提升了检测模型的泛化能力,为大型语言模型应用安全提供了实用解决方案。
English: The SPADE framework introduces a structured approach using prompt-based samples to enhance synthetic dialogue detection, creating new datasets that improve model generalization and security for LLM applications.

Authors:Honglin Lin, Zhuoshi Pan, Yu Li, Qizhi Pei, Xin Gao, Mengzhang Cai, Conghui He, Lijun Wu
Title: MetaLadder: Ascending Mathematical Solution Quality via Analogical-Problem Reasoning Transfer
Abstract:
Large Language Models (LLMs) have demonstrated promising capabilities in solving mathematical reasoning tasks, leveraging Chain-of-Thought (CoT) data as a vital component in guiding answer generation. Current paradigms typically generate CoT and answers directly for a given problem, diverging from human problem-solving strategies to some extent. Humans often solve problems by recalling analogous cases and leveraging their solutions to reason about the current task. Inspired by this cognitive process, we propose \textbf{MetaLadder}, a novel framework that explicitly prompts LLMs to recall and reflect on meta-problems, those structurally or semantically analogous problems, alongside their CoT solutions before addressing the target problem. Additionally, we introduce a problem-restating mechanism to enhance the model's comprehension of the target problem by regenerating the original question, which further improves reasoning accuracy. Therefore, the model can achieve reasoning transfer from analogical problems, mimicking human-like "learning from examples" and generalization abilities. Extensive experiments on mathematical benchmarks demonstrate that our MetaLadder significantly boosts LLMs' problem-solving accuracy, largely outperforming standard CoT-based methods (\textbf{10.3\%} accuracy gain) and other methods. Our code and data has been released at https://github.com/LHL3341/MetaLadder.
中文摘要:MetaLadder框架通过让大语言模型在解决问题前先回忆类似问题及其推理过程,并采用问题重述机制增强理解,实现了类比推理迁移,在数学基准测试中比标准方法显著提升10.3%的准确率。
English Summary: The proposed MetaLadder framework enhances LLMs' mathematical reasoning by prompting them to recall analogous problems and their solutions before addressing target tasks, achieving a 10.3% accuracy improvement over standard methods through problem-restating and analogical reasoning.

Authors:Chejian Xu, Jiawei Zhang, Zhaorun Chen, Chulin Xie, Mintong Kang, Yujin Potter, Zhun Wang, Zhuowen Yuan, Alexander Xiong, Zidi Xiong, Chenhui Zhang, Lingzhi Yuan, Yi Zeng, Peiyang Xu, Chengquan Guo, Andy Zhou, Jeffrey Ziwei Tan, Xuandong Zhao, Francesco Pinto, Zhen Xiang, Yu Gai, Zinan Lin, Dan Hendrycks, Bo Li, Dawn Song
Title: MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models
Abstract:
Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.
中文: 本文提出了首个统一平台MMDT,旨在从安全性、幻觉、公平性等多维度全面评估多模态基础模型的安全可信度,揭示了模型漏洞并为开发更可靠的系统铺平了道路。
English: This paper introduces MMDT, the first unified platform for comprehensively evaluating the safety and trustworthiness of multimodal foundation models across multiple dimensions such as safety, hallucination, and fairness, revealing vulnerabilities and paving the way for more reliable systems.

Authors:Yicheng Fu, Zikui Wang, Liuxin Yang, Meiqing Huo, Zhongdongming Dai
Title: ConQuer: A Framework for Concept-Based Quiz Generation
Abstract:
Quizzes play a crucial role in education by reinforcing students' understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework. Code available at https://github.com/sofyc/ConQuer.
中文: ConQuer是一个基于概念的测验生成框架,通过整合外部知识提升AI生成测验的质量,在评估分数和成对比较中均取得显著提升。
English: ConQuer is a concept-based quiz generation framework that enhances AI-generated quiz quality by integrating external knowledge, achieving significant improvements in evaluation scores and pairwise comparisons.

Authors:Sara Sarto, Marcella Cornia, Rita Cucchiara
Title: Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
Abstract:
The evaluation of machine-generated image captions is a complex and evolving challenge. With the advent of Multimodal Large Language Models (MLLMs), image captioning has become a core task, increasing the need for robust and reliable evaluation metrics. This survey provides a comprehensive overview of advancements in image captioning evaluation, analyzing the evolution, strengths, and limitations of existing metrics. We assess these metrics across multiple dimensions, including correlation with human judgment, ranking accuracy, and sensitivity to hallucinations. Additionally, we explore the challenges posed by the longer and more detailed captions generated by MLLMs and examine the adaptability of current metrics to these stylistic variations. Our analysis highlights some limitations of standard evaluation approaches and suggests promising directions for future research in image captioning assessment.
中文: 本文综述了图像描述评估指标的发展历程与局限性,重点分析了多模态大语言模型生成详细描述带来的挑战,并提出了未来研究方向。
English: This survey comprehensively reviews the evolution and limitations of image captioning evaluation metrics, emphasizing the challenges posed by MLLMs' detailed outputs and proposing future research directions.

Authors:Jiacheng Guo, Yue Wu, Jiahao Qiu, Kaixuan Huang, Xinzhe Juan, Ling Yang, Mengdi Wang
Title: Temporal Consistency for LLM Reasoning Process Error Identification
Abstract:
Verification is crucial for effective mathematical reasoning. We present a new temporal consistency method where verifiers iteratively refine their judgments based on the previous assessment. Unlike one-round verification or multi-model debate approaches, our method leverages consistency in a sequence of self-reflection actions to improve verification accuracy. Empirical evaluations across diverse mathematical process error identification benchmarks (Mathcheck, ProcessBench, and PRM800K) show consistent performance improvements over baseline methods. When applied to the recent DeepSeek R1 distilled models, our method demonstrates strong performance, enabling 7B/8B distilled models to outperform all 70B/72B models and GPT-4o on ProcessBench. Notably, the distilled 14B model with our method achieves performance comparable to Deepseek-R1. Our codes are available at https://github.com/jcguo123/Temporal-Consistency
中文: 本文提出了一种时序一致性方法,通过迭代性自我反思提升数学推理验证效果,在多个基准测试中表现优异,使小型蒸馏模型性能超越包括GPT-4o在内的大型模型。
English: This paper introduces a temporal consistency method that enhances mathematical reasoning verification through iterative self-reflection, achieving superior performance on multiple benchmarks and enabling smaller distilled models to outperform larger ones, including GPT-4o.

Authors:Fardin Saad, Pradeep K. Murukannaiah, Munindar P. Singh
Title: Gricean Norms as a Basis for Effective Collaboration
Abstract:
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
有效的人机协作需要AI代理运用格莱斯会话准则处理模糊指令,配备此规范的Lamoid代理在合作任务中展现出更高的准确性和语境适应性。
Effective human-AI collaboration requires AI agents to handle unclear instructions using Gricean conversational norms, as demonstrated by the improved performance of Lamoid agents equipped with these principles in collaborative tasks.

Authors:Bo Peng, Ruichong Zhang, Daniel Goldstein, Eric Alcaide, Xingjian Du, Haowen Hou, Jiaju Lin, Jiaxing Liu, Janna Lu, William Merrill, Guangyu Song, Kaifeng Tan, Saiteja Utpala, Nathan Wilce, Johan S. Wind, Tianyi Wu, Daniel Wuttke, Christian Zhou-Zheng
Title: RWKV-7 "Goose" with Expressive Dynamic State Evolution
Abstract:
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token. Despite being trained on dramatically fewer tokens than other top models, our 2.9 billion parameter language model achieves a new 3B SoTA on multilingual tasks and matches the current 3B SoTA on English language downstream performance. RWKV-7 introduces a newly generalized formulation of the delta rule with vector-valued gating and in-context learning rates, as well as a relaxed value replacement rule. We show that RWKV-7 can perform state tracking and recognize all regular languages, while retaining parallelizability of training. This exceeds the capabilities of Transformers under standard complexity conjectures, which are limited to $\mathsf{TC}^0$. To demonstrate RWKV-7's language modeling capability, we also present an extended open source 3.1 trillion token multilingual corpus, and train four RWKV-7 models ranging from 0.19 billion to 2.9 billion parameters on this dataset. To foster openness, reproduction, and adoption, we release our models and dataset component listing at https://huggingface.co/RWKV, and our training and inference code at https://github.com/RWKV/RWKV-LM all under the Apache 2.0 License.
中文: RWKV-7 "Goose" 是一种新型序列建模架构,尽管训练数据量较少,却能在多语言任务中达到顶尖性能,同时保持恒定内存和推理时间,并能执行状态跟踪和识别所有正则语言,超越了Transformer的固有局限。
English: RWKV-7 "Goose" is a novel sequence modeling architecture that achieves state-of-the-art performance in multilingual tasks with constant memory and inference time, despite training on fewer tokens, and demonstrates capabilities beyond Transformers by performing state tracking and recognizing all regular languages.

Authors:Vlad Hondru, Eduard Hogea, Darian Onchis, Radu Tudor Ionescu
Title: ExDDV: A New Dataset for Explainable Deepfake Detection in Video
Abstract:
The ever growing realism and quality of generated videos makes it increasingly harder for humans to spot deepfake content, who need to rely more and more on automatic deepfake detectors. However, deepfake detectors are also prone to errors, and their decisions are not explainable, leaving humans vulnerable to deepfake-based fraud and misinformation. To this end, we introduce ExDDV, the first dataset and benchmark for Explainable Deepfake Detection in Video. ExDDV comprises around 5.4K real and deepfake videos that are manually annotated with text descriptions (to explain the artifacts) and clicks (to point out the artifacts). We evaluate a number of vision-language models on ExDDV, performing experiments with various fine-tuning and in-context learning strategies. Our results show that text and click supervision are both required to develop robust explainable models for deepfake videos, which are able to localize and describe the observed artifacts. Our novel dataset and code to reproduce the results are available at https://github.com/vladhondru25/ExDDV.
Chinese: 随着深度伪造视频的真实感日益增强,人类和自动化检测系统都面临挑战,为此我们推出了首个可解释深度伪造检测数据集ExDDV,通过文本和点击标注提升模型对伪造痕迹的定位与描述能力,确保检测结果既可靠又可解释。
English: The increasing realism of deepfake videos challenges both human detection and automated systems, which often lack explainability, prompting the introduction of ExDDV—the first dataset and benchmark for explainable deepfake detection, using text and click annotations to enhance model robustness and localization of artifacts.

Authors:Chenxiao Yang, Nathan Srebro, David McAllester, Zhiyuan Li
Title: PENCIL: Long Thoughts with Short Memory
Abstract:
While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems at test-time is fundamentally limited by suboptimal memory usage -- intermediate computations accumulate indefinitely in context even when no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively cleans up intermediate thoughts based on patterns learned from training. By iteratively generating and erasing thoughts, PENCIL can think deeper to solve harder problems using shorter context and less compute. Empirically, we observe PENCIL is significantly more effective and efficient than CoT. For example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein's puzzle -- a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.
中文: PENCIL通过引入一种新颖的消减机制,在推理过程中递归清理中间思考,从而能以更短的上下文和更少的计算资源解决比传统思维链方法更复杂的问题。
English: PENCIL introduces a novel reduction mechanism that recursively cleans up intermediate thoughts during reasoning, enabling deeper problem-solving with shorter context and less computation than traditional Chain-of-Thought methods.

Authors:Weihang Su, Baoqing Yue, Qingyao Ai, Yiran Hu, Jiaqi Li, Changyue Wang, Kaiyuan Zhang, Yueyue Wu, Yiqun Liu
Title: JuDGE: Benchmarking Judgment Document Generation for Chinese Legal System
Abstract:
This paper introduces JuDGE (Judgment Document Generation Evaluation), a novel benchmark for evaluating the performance of judgment document generation in the Chinese legal system. We define the task as generating a complete legal judgment document from the given factual description of the case. To facilitate this benchmark, we construct a comprehensive dataset consisting of factual descriptions from real legal cases, paired with their corresponding full judgment documents, which serve as the ground truth for evaluating the quality of generated documents. This dataset is further augmented by two external legal corpora that provide additional legal knowledge for the task: one comprising statutes and regulations, and the other consisting of a large collection of past judgment documents. In collaboration with legal professionals, we establish a comprehensive automated evaluation framework to assess the quality of generated judgment documents across various dimensions. We evaluate various baseline approaches, including few-shot in-context learning, fine-tuning, and a multi-source retrieval-augmented generation (RAG) approach, using both general and legal-domain LLMs. The experimental results demonstrate that, while RAG approaches can effectively improve performance in this task, there is still substantial room for further improvement. All the codes and datasets are available at: https://github.com/oneal2000/JuDGE.
中文摘要:本文提出JuDGE这一中国法律判决文书生成评估新基准,通过包含真实案例的完整数据集和自动化评估框架证明检索增强方法能有效提升生成质量,但仍需进一步改进。
English Summary: This paper presents JuDGE, a new benchmark for evaluating judgment document generation in Chinese law, featuring a comprehensive dataset and automated evaluation framework that shows retrieval-augmented methods improve performance but require further development.

Authors:Yongqi Li, Lu Yang, Jian Wang, Runyang You, Wenjie Li, Liqiang Nie
Title: Towards Harmless Multimodal Assistants with Blind Preference Optimization
Abstract:
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM's unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.
中文摘要:本研究提出MMSafe-PO数据集和盲偏好优化方法,显著提升多模态大语言模型的安全性能,在基准测试中实现45%的安全率提升并展现卓越鲁棒性。
English Summary: The study introduces the MMSafe-PO dataset and Blind Preference Optimization (BPO) method to enhance multimodal large language models' safety, achieving a 45% safety improvement and demonstrating robustness across benchmarks.

Authors:Mykyta Syromiatnikov, Victoria Ruvinskaya, Nataliia Komleva
Title: Empowering Smaller Models: Tuning LLaMA and Gemma with Chain-of-Thought for Ukrainian Exam Tasks
Abstract:
Leading large language models have demonstrated impressive capabilities in reasoning-intensive tasks, such as standardized educational testing. However, they often require extensive training in low-resource settings with inaccessible infrastructure. Small or compact models, though more efficient, frequently lack sufficient support for underrepresented languages, leaving a performance gap in critical domains. This work explores the potential of parameter-efficient fine-tuning of compact open-weight language models to handle reasoning-intensive tasks in the underrepresented Ukrainian language, building on the findings of the ZNO-Eval benchmark. Parameter-efficient fine-tuning of LLaMA 3.1 (8 billion parameters), LLaMA 3.2 (3 billion parameters), and Gemma 2 (9 billion parameters) models on chain-of-thought solutions resulted in a modest test score improvement of up to 17.4% on complex matching tasks and 1.6% overall compared to tuning on answer letters alone, offering enhanced interpretability and robustness. In addition, the proposed tuning method with joint task topic and step-by-step solution generation outperforms standard chain-of-thought tuning in matching tasks and provides a 5.4% gain over the best LLaMA 3.2 model due to guiding the model to recall and apply domain-relevant information. Contrasting obtained results with zero-shot evaluations of leading open-weight and proprietary models such as Qwen, DeepSeek R1, OpenAI o1 and o3, Gemini, and Claude, highlight that fine-tuning LLaMA and Gemma models with 2,032 step-by-step solutions and 20 to 50 million trainable parameters on a single A100 GPU lets them outperform GPT-4o mini, Mistral Large, and larger open-weight models. This research also evaluates how merging the quantized adapter with the base model influences the generation quality. Source code and tuned models are available at https://github.com/NLPForUA/ZNO.
中文: 本研究证明,通过对LLaMA和Gemma等紧凑型语言模型进行参数高效微调,能显著提升其在乌克兰语推理任务中的表现,使其在保持计算效率的同时超越GPT-4o mini和Mistral Large等更大模型。
English: This study demonstrates that parameter-efficient fine-tuning of compact language models like LLaMA and Gemma significantly enhances their performance on reasoning tasks in Ukrainian, enabling them to surpass larger models including GPT-4o mini and Mistral Large while maintaining computational efficiency.

Authors:Xiaoying Xing, Chia-Wen Kuo, Li Fuxin, Yulei Niu, Fan Chen, Ming Li, Ying Wu, Longyin Wen, Sijie Zhu
Title: Where do Large Vision-Language Models Look at when Answering Questions?
Abstract:
Large Vision-Language Models (LVLMs) have shown promising performance in vision-language understanding and reasoning tasks. However, their visual understanding behaviors remain underexplored. A fundamental question arises: to what extent do LVLMs rely on visual input, and which image regions contribute to their responses? It is non-trivial to interpret the free-form generation of LVLMs due to their complicated visual architecture (e.g., multiple encoders and multi-resolution) and variable-length outputs. In this paper, we extend existing heatmap visualization methods (e.g., iGOS++) to support LVLMs for open-ended visual question answering. We propose a method to select visually relevant tokens that reflect the relevance between generated answers and input image. Furthermore, we conduct a comprehensive analysis of state-of-the-art LVLMs on benchmarks designed to require visual information to answer. Our findings offer several insights into LVLM behavior, including the relationship between focus region and answer correctness, differences in visual attention across architectures, and the impact of LLM scale on visual understanding. The code and data are available at https://github.com/bytedance/LVLM_Interpretation.
Chinese: 本研究扩展了热力图可视化方法,以解释大型视觉语言模型在开放式问答中如何利用视觉输入,揭示了其关注区域、架构差异及语言模型规模对视觉理解影响的重要发现。
English: This study extends heatmap visualization methods to interpret how Large Vision-Language Models (LVLMs) utilize visual inputs for open-ended question answering, revealing key insights into their focus regions, architectural differences, and the impact of language model scale on visual understanding.

Authors:Pingyu Wu, Daiheng Gao, Jing Tang, Huimin Chen, Wenbo Zhou, Weiming Zhang, Nenghai Yu
Title: MES-RAG: Bringing Multi-modal, Entity-Storage, and Secure Enhancements to RAG
Abstract:
Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by using external knowledge, but it struggles with precise entity information retrieval. In this paper, we proposed MES-RAG framework, which enhances entity-specific query handling and provides accurate, secure, and consistent responses. MES-RAG introduces proactive security measures that ensure system integrity by applying protections prior to data access. Additionally, the system supports real-time multi-modal outputs, including text, images, audio, and video, seamlessly integrating into existing RAG architectures. Experimental results demonstrate that MES-RAG significantly improves both accuracy and recall, highlighting its effectiveness in advancing the security and utility of question-answering, increasing accuracy to 0.83 (+0.25) on targeted task. Our code and data are available at https://github.com/wpydcr/MES-RAG.
中文:MES-RAG框架通过增强实体查询处理能力、采用主动安全措施及支持实时多模态输出,显著提升了检索增强生成系统的准确性和召回率,有效推进问答系统的安全性与实用性。
English: The MES-RAG framework enhances Retrieval-Augmented Generation by improving entity-specific query handling with proactive security measures and real-time multi-modal outputs, significantly boosting accuracy and recall in question-answering systems.

Authors:Hao Cui, Zahra Shamsi, Gowoon Cheon, Xuejian Ma, Shutong Li, Maria Tikhanovskaya, Peter Norgaard, Nayantara Mudur, Martyna Plomecka, Paul Raccuglia, Yasaman Bahri, Victor V. Albert, Pranesh Srinivasan, Haining Pan, Philippe Faist, Brian Rohr, Ekin Dogus Cubuk, Muratahan Aykol, Amil Merchant, Michael J. Statt, Dan Morris, Drew Purves, Elise Kleeman, Ruth Alcantara, Matthew Abraham, Muqthar Mohammad, Ean Phing VanLee, Chenfei Jiang, Elizabeth Dorfman, Eun-Ah Kim, Michael P Brenner, Viren Jain, Sameera Ponda, Subhashini Venugopalan
Title: CURIE: Evaluating LLMs On Multitask Scientific Long Context Understanding and Reasoning
Abstract:
Scientific problem-solving involves synthesizing information while applying expert knowledge. We introduce CURIE, a scientific long-Context Understanding,Reasoning and Information Extraction benchmark to measure the potential of Large Language Models (LLMs) in scientific problem-solving and assisting scientists in realistic workflows. This benchmark introduces ten challenging tasks with a total of 580 problems and solution pairs curated by experts in six disciplines - materials science, condensed matter physics, quantum computing, geospatial analysis, biodiversity, and proteins - covering both experimental and theoretical work-flows in science. We evaluate a range of closed and open LLMs on tasks in CURIE which requires domain expertise, comprehension of long in-context information,and multi-step reasoning. While Gemini Flash 2.0 and Claude-3 show consistent high comprehension across domains, the popular GPT-4o and command-R+ fail dramatically on protein sequencing tasks. With the best performance at 32% there is much room for improvement for all models. We hope that insights gained from CURIE can guide the future development of LLMs in sciences. Evaluation code and data are in https://github.com/google/curie
Chinese: CURIE 是一个科学基准,旨在评估大语言模型在六个学科领域的十项挑战性任务中的理解、推理和信息提取能力,结果显示尽管部分模型表现稳定,但整体仍有巨大提升空间。
English: CURIE is a scientific benchmark designed to evaluate Large Language Models' capabilities in understanding, reasoning, and extracting information across ten challenging tasks in six disciplines, revealing significant room for improvement despite some models showing consistent comprehension.

Authors:Jia Xu, Tianyi Wei, Bojian Hou, Patryk Orzechowski, Shu Yang, Ruochen Jin, Rachael Paulbeck, Joost Wagenaar, George Demiris, Li Shen
Title: MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance
Abstract:
We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being. The dataset is available at https://huggingface.co/datasets/ShenLab/MentalChat16K and the code and documentation are hosted on GitHub at https://github.com/ChiaPatricia/MentalChat16K.
中文: MentalChat16K是一个结合合成与匿名心理健康咨询记录的英文数据集,旨在推动共情式对话助手的AI技术发展,同时严格保障数据隐私与伦理规范。
English: MentalChat16K is a specialized English dataset combining synthetic and anonymized mental health counseling transcripts, designed to advance AI development for empathetic conversational assistance while ensuring privacy and ethical data use.

Authors:Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick M. Blies, Günter Klambauer, Sebastian Böck, Sepp Hochreiter
Title: xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference
Abstract:
Recent breakthroughs in solving reasoning, math and coding problems with Large Language Models (LLMs) have been enabled by investing substantial computation budgets at inference time. Therefore, inference speed is one of the most critical properties of LLM architectures, and there is a growing need for LLMs that are efficient and fast at inference. Recently, LLMs built on the xLSTM architecture have emerged as a powerful alternative to Transformers, offering linear compute scaling with sequence length and constant memory usage, both highly desirable properties for efficient inference. However, such xLSTM-based LLMs have yet to be scaled to larger models and assessed and compared with respect to inference speed and efficiency. In this work, we introduce xLSTM 7B, a 7-billion-parameter LLM that combines xLSTM's architectural benefits with targeted optimizations for fast and efficient inference. Our experiments demonstrate that xLSTM 7B achieves performance on downstream tasks comparable to other similar-sized LLMs, while providing significantly faster inference speeds and greater efficiency compared to Llama- and Mamba-based LLMs. These results establish xLSTM 7B as the fastest and most efficient 7B LLM, offering a solution for tasks that require large amounts of test-time computation. Our work highlights xLSTM's potential as a foundational architecture for methods building on heavy use of LLM inference. Our model weights, model code and training code are open-source.
中文: 基于xLSTM架构的最新突破催生了xLSTM 7B模型,这个70亿参数的模型在保持优异任务性能的同时,实现了比同类模型更快的推理速度和更高的效率。
English: Recent advances in xLSTM architecture enable the development of xLSTM 7B, a 7-billion-parameter model that outperforms similar-sized LLMs in inference speed and efficiency while maintaining competitive task performance.

Authors:Dengyun Peng, Yuhang Zhou, Qiguang Chen, Jinhao Liu, Jingjing Chen, Libo Qin
Title: DLPO: Towards a Robust, Efficient, and Generalizable Prompt Optimization Framework from a Deep-Learning Perspective
Abstract:
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, largely driven by well-designed prompts. However, crafting and selecting such prompts often requires considerable human effort, significantly limiting its scalability. To mitigate this, recent studies have explored automated prompt optimization as a promising solution. Despite these efforts, existing methods still face critical challenges in robustness, efficiency, and generalization. To systematically address these challenges, we first conduct an empirical analysis to identify the limitations of current reflection-based prompt optimization paradigm. Building on these insights, we propose 7 innovative approaches inspired by traditional deep learning paradigms for prompt optimization (DLPO), seamlessly integrating these concepts into text-based gradient optimization. Through these advancements, we progressively tackle the aforementioned challenges and validate our methods through extensive experimentation. We hope our study not only provides valuable guidance for future research but also offers a comprehensive understanding of the challenges and potential solutions in prompt optimization. Our code is available at https://github.com/sfasfaffa/DLPO.
中文摘要:本研究受深度学习范式启发提出七种创新方法,通过经验分析和广泛实验实现大语言模型的自动提示优化,有效解决了鲁棒性、效率与泛化能力等关键挑战。
English Summary: This study introduces seven innovative methods inspired by deep learning paradigms to automate prompt optimization for LLMs, addressing key challenges in robustness, efficiency, and generalization through empirical analysis and extensive experimentation.

Authors:James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Title: MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research
Abstract:
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology. Despite recent advances in multimodal large language models (MLLMs) for AI-assisted research, existing multimodal reasoning benchmarks only target up to college-level difficulty, while research-level benchmarks emphasize lower-level perception, falling short of the complex multimodal reasoning needed for scientific discovery. To bridge this gap, we introduce MicroVQA, a visual-question answering (VQA) benchmark designed to assess three reasoning capabilities vital in research workflows: expert image understanding, hypothesis generation, and experiment proposal. MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities, ensuring VQA samples represent real scientific practice. In constructing the benchmark, we find that standard MCQ generation methods induce language shortcuts, motivating a new two-stage pipeline: an optimized LLM prompt structures question-answer pairs into MCQs; then, an agent-based `RefineBot' updates them to remove shortcuts. Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53\%; models with smaller LLMs only slightly underperform top models, suggesting that language-based reasoning is less challenging than multimodal reasoning; and tuning with scientific articles enhances performance. Expert analysis of chain-of-thought responses shows that perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors. These insights highlight the challenges in multimodal scientific reasoning, showing MicroVQA is a valuable resource advancing AI-driven biomedical research. MicroVQA is available at https://huggingface.co/datasets/jmhb/microvqa, and project page at https://jmhb0.github.io/microvqa.
中文: MicroVQA是一个专为生物学研究设计的视觉问答基准,用于评估多模态推理能力,填补了现有基准的不足,并通过测试先进模型揭示了感知错误是科学推理中的主要挑战。
English: MicroVQA is a research-level visual question answering benchmark developed to evaluate critical multimodal reasoning skills in biology, addressing gaps in existing benchmarks and revealing key challenges through testing state-of-the-art models.

Authors:Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, Xiangnan Fang, Zewen He, Zhenbo Luo, Wenxuan Wang, Junqi Lin, Jian Luan, Qin Jin
Title: Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Abstract:
Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Authors:Ying Jiao, Luc De Raedt, Giuseppe Marra
Title: Valid Text-to-SQL Generation with Unification-based DeepStochLog
Abstract:
Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at https://github.com/ML-KULeuven/deepstochlog-lm.
中文摘要:本文提出了一种神经符号框架,通过基于合一的定子句语法施加SQL语法和数据库模式约束,确保生成的所有查询均有效,大幅提升了语言模型在自然语言转SQL任务中的表现。
English Summary: This paper introduces a neurosymbolic framework that enforces SQL syntax and schema constraints using unification-based grammars to ensure all generated queries are valid, significantly improving the language model's performance in natural language to SQL translation.

Authors:Fangzhi Xu, Hang Yan, Chang Ma, Haiteng Zhao, Jun Liu, Qika Lin, Zhiyong Wu
Title: $ϕ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation
Abstract:
Inference-time optimization scales computation to derive deliberate reasoning steps for effective performance. While previous search-based strategies address the short-sightedness of auto-regressive generation, the vast search space leads to excessive exploration and insufficient exploitation. To strike an efficient balance to derive the optimal step, we frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation. Built on it, we propose a novel decoding strategy, named $ϕ$-Decoding. To provide a precise and expressive estimation of step value, $ϕ$-Decoding approximates two distributions via foresight and clustering. Sampling from the joint distribution, the optimal steps can be selected for exploitation. To support adaptive computation allocation, we propose in-width and in-depth pruning strategies, featuring a light-weight solution to achieve inference efficiency. Extensive experiments across seven benchmarks show $ϕ$-Decoding outperforms strong baselines in both performance and efficiency. Additional analysis demonstrates its generalization across various LLMs and scalability across a wide range of computing budgets. The code will be released at https://github.com/xufangzhi/phi-Decoding, and the open-source PyPI package is coming soon.
中文:提出的$ϕ$-解码策略通过前瞻采样和聚类优化推理步骤,在多个基准测试中实现了卓越的性能与效率,并通过剪枝技术支持自适应计算分配。
English: The proposed $ϕ$-Decoding strategy uses foresight sampling and clustering to optimize reasoning steps, achieving superior performance and efficiency across benchmarks while supporting adaptive computation through pruning techniques.

Authors:Chi Han, Xin Liu, Haodong Wang, Shiyang Li, Jingfeng Yang, Haoming Jiang, Zhengyang Wang, Qingyu Yin, Liang Qiu, Changlong Yu, Yifan Gao, Zheng Li, Bing Yin, Jingbo Shang, Heng Ji
Title: Can Language Models Follow Multiple Turns of Entangled Instructions?
Abstract:
Despite significant achievements in improving the instruction-following capabilities of large language models (LLMs), the ability to process multiple potentially entangled or conflicting instructions remains a considerable challenge. Real-world scenarios often require consistency across multiple instructions over time, such as secret privacy, personal preferences, and prioritization, which demand sophisticated abilities to integrate multiple turns and carefully balance competing objectives when instructions intersect or conflict. This work presents a systematic investigation of LLMs' capabilities in handling multiple turns of instructions, covering three levels of difficulty: (1) retrieving information from instructions, (2) tracking and reasoning across turns, and (3) resolving conflicts among instructions. We construct MultiTurnInstruct~with $\sim$1.1K high-quality multi-turn conversations through the human-in-the-loop approach and result in nine capability categories, including statics and dynamics, reasoning, and multitasking. Our finding reveals an intriguing trade-off between different capabilities. While GPT models demonstrate superior memorization, they show reduced effectiveness in privacy-protection tasks requiring selective information withholding. Larger models exhibit stronger reasoning capabilities but still struggle with resolving conflicting instructions. Importantly, these performance gaps cannot be attributed solely to information loss, as models demonstrate strong BLEU scores on memorization tasks. Still, their attention mechanisms fail to integrate multiple related instructions effectively. These findings highlight critical areas for improvement in complex real-world tasks involving multi-turn instructions. Data and codes are released at https://github.com/Glaciohound/Multi-Turn-Instruct.
中文: 本研究系统评估了大语言模型处理多轮指令的能力,发现模型存在能力权衡——虽在记忆方面表现优异,但即使大型模型具备更强推理能力,仍在冲突解决和隐私保护任务上存在明显不足。
English: This study systematically evaluates large language models' ability to handle multi-turn instructions, revealing a trade-off where models excel at memorization but struggle with conflict resolution and privacy protection despite strong reasoning capabilities in larger models.

Authors:Haiyang Guo, Fanhu Zeng, Ziwei Xiang, Fei Zhu, Da-Han Wang, Xu-Yao Zhang, Cheng-Lin Liu
Title: HiDe-LLaVA: Hierarchical Decoupling for Continual Instruction Tuning of Multimodal Large Language Model
Abstract:
Instruction tuning is widely used to improve a pre-trained Multimodal Large Language Model (MLLM) by training it on curated task-specific datasets, enabling better comprehension of human instructions. However, it is infeasible to collect all possible instruction datasets simultaneously in real-world scenarios. Thus, enabling MLLM with continual instruction tuning is essential for maintaining their adaptability. However, existing methods often trade off memory efficiency for performance gains, significantly compromising overall efficiency. In this paper, we propose a task-specific expansion and task-general fusion framework based on the variations in Centered Kernel Alignment (CKA) similarity across different model layers when trained on diverse datasets. Furthermore, we analyze the information leakage present in the existing benchmark and propose a new and more challenging benchmark to rationally evaluate the performance of different methods. Comprehensive experiments showcase a significant performance improvement of our method compared to existing state-of-the-art methods. Code and dataset are released at https://github.com/Ghy0501/HiDe-LLaVA.
中文: 本文提出了一种基于层间相似度变化的多模态大语言模型持续指令调优框架,通过任务特定扩展与任务通用融合提升模型适应性,并构建了更严谨的基准测试以解决现有评估中的信息泄露问题。
English: This paper introduces a framework for continual instruction tuning of Multimodal Large Language Models that enhances adaptability by balancing task-specific expansion and task-general fusion, while also proposing a more challenging benchmark to address information leakage in existing evaluations.

Authors:Duke Nguyen, Aditya Joshi, Flora Salim
Title: Harnessing Test-time Adaptation for NLU tasks Involving Dialects of English
Abstract:
Test-time adaptation (TTA) is an excellent method which helps generalize models across domains, tasks, and distributions without the use of labeled datasets. Thus, TTA is very useful in natural language processing (NLP) in the dialectal setting, since oftentimes, models are trained on Standard American English (SAE), evaluated on Indian English or Nigerian English, of which distribution differs significantly from the former. This is especially useful since dialectal datasets are scarce. In this paper, we explore one of the most famous TTA techniques, SHOT, in dialectal NLP. We finetune and evaluate SHOT on different combinations of dialectal GLUE. Our findings show that SHOT is a viable technique when labeled datasets are unavailable. We also theoretically propose the concept of dialectal gap and show that it has a positive correlation with the effectiveness of SHOT. We also find that in many cases, finetuning on SAE yields higher performance than finetuning on dialectal data. Our code is available at https://github.com/dukenguyenxyz/dialect-adaptation
中文摘要:测试时适配(TTA)可在无标注数据情况下有效泛化模型至不同方言分布,其中SHOT方法被证明可行且其效果与方言差异正相关,而基于标准美式英语的微调常优于方言数据。
English Summary: Test-time adaptation (TTA) effectively generalizes models across dialectal distributions without labeled data, with SHOT proving viable and its effectiveness positively correlated with dialectal gaps, while SAE-based finetuning often outperforms dialectal data.

Authors:Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Xiaoyi Feng, Maosong Sun
Title: DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs for Knowledge-Intensive Visual Grounding
Abstract:
Human experts excel at fine-grained visual discrimination by leveraging domain knowledge to refine perceptual features, a capability that remains underdeveloped in current Multimodal Large Language Models (MLLMs). Despite possessing vast expert-level knowledge, MLLMs struggle to integrate reasoning into visual perception, often generating direct responses without deeper analysis. To bridge this gap, we introduce knowledge-intensive visual grounding (KVG), a novel visual grounding task that requires both fine-grained perception and domain-specific knowledge integration. To address the challenges of KVG, we propose DeepPerception, an MLLM enhanced with cognitive visual perception capabilities. Our approach consists of (1) an automated data synthesis pipeline that generates high-quality, knowledge-aligned training samples, and (2) a two-stage training framework combining supervised fine-tuning for cognitive reasoning scaffolding and reinforcement learning to optimize perception-cognition synergy. To benchmark performance, we introduce KVG-Bench a comprehensive dataset spanning 10 domains with 1.3K manually curated test cases. Experimental results demonstrate that DeepPerception significantly outperforms direct fine-tuning, achieving +8.08\% accuracy improvements on KVG-Bench and exhibiting +4.60\% superior cross-domain generalization over baseline approaches. Our findings highlight the importance of integrating cognitive processes into MLLMs for human-like visual perception and open new directions for multimodal reasoning research. The data, codes, and models are released at https://github.com/thunlp/DeepPerception.
Chinese Summary: 本文提出DeepPerception模型,通过知识密集型视觉定位任务,结合认知推理框架与强化学习,显著提升了多模态大语言模型在细粒度视觉感知与领域知识融合方面的性能。
English Summary: This paper introduces DeepPerception, a Multimodal Large Language Model enhanced with cognitive visual perception to bridge the gap between expert knowledge and fine-grained visual discrimination through knowledge-intensive visual grounding.

Authors:Jacob Chmura, Jonah Dauvet, Sebastian Sabry
Title: Plausibility Vaccine: Injecting LLM Knowledge for Event Plausibility
Abstract:
Despite advances in language modelling, distributional methods that build semantic representations from co-occurrences fail to discriminate between plausible and implausible events. In this work, we investigate how plausibility prediction can be improved by injecting latent knowledge prompted from large language models using parameter-efficient fine-tuning. We train 12 task adapters to learn various physical properties and association measures and perform adapter fusion to compose latent semantic knowledge from each task on top of pre-trained AlBERT embeddings. We automate auxiliary task data generation, which enables us to scale our approach and fine-tune our learned representations across two plausibility datasets. Our code is available at https://github.com/Jacob-Chmura/plausibility-vaccine.
中文: 本研究通过参数高效微调将大语言模型的潜在知识注入,在预训练的AlBERT嵌入上使用适配器融合技术,并自动生成辅助任务数据,从而提升了两个合理性数据集中的事件合理性预测能力。
English: This study enhances plausibility prediction by integrating latent knowledge from large language models through parameter-efficient fine-tuning, using adapter fusion on pre-trained AlBERT embeddings and automating auxiliary task data generation across two datasets.

Authors:Imran Kabir, Md Alimoor Reza, Syed Billah
Title: Logic-RAG: Augmenting Large Multimodal Models with Visual-Spatial Knowledge for Road Scene Understanding
Abstract:
Large multimodal models (LMMs) are increasingly integrated into autonomous driving systems for user interaction. However, their limitations in fine-grained spatial reasoning pose challenges for system interpretability and user trust. We introduce Logic-RAG, a novel Retrieval-Augmented Generation (RAG) framework that improves LMMs' spatial understanding in driving scenarios. Logic-RAG constructs a dynamic knowledge base (KB) about object-object relationships in first-order logic (FOL) using a perception module, a query-to-logic embedder, and a logical inference engine. We evaluated Logic-RAG on visual-spatial queries using both synthetic and real-world driving videos. When using popular LMMs (GPT-4V, Claude 3.5) as proxies for an autonomous driving system, these models achieved only 55% accuracy on synthetic driving scenes and under 75% on real-world driving scenes. Augmenting them with Logic-RAG increased their accuracies to over 80% and 90%, respectively. An ablation study showed that even without logical inference, the fact-based context constructed by Logic-RAG alone improved accuracy by 15%. Logic-RAG is extensible: it allows seamless replacement of individual components with improved versions and enables domain experts to compose new knowledge in both FOL and natural language. In sum, Logic-RAG addresses critical spatial reasoning deficiencies in LMMs for autonomous driving applications. Code and data are available at https://github.com/Imran2205/LogicRAG.
Chinese: Logic-RAG是一种新颖的检索增强生成框架,通过一阶逻辑构建动态知识库,显著提升多模态大模型在自动驾驶中的空间推理能力,有效解决了现有模型在细粒度空间理解上的不足。
English: Logic-RAG is a novel Retrieval-Augmented Generation framework that enhances large multimodal models' spatial reasoning in autonomous driving by constructing a dynamic knowledge base with first-order logic, significantly improving accuracy on visual-spatial queries.

Authors:Zhiwei He, Zhaopeng Tu, Xing Wang, Xingyu Chen, Zhijie Wang, Jiahao Xu, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Rui Wang
Title: RaSA: Rank-Sharing Low-Rank Adaptation
Abstract:
Low-rank adaptation (LoRA) has been prominently employed for parameter-efficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of LoRA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (RaSA), an innovative extension that enhances the expressive capacity of LoRA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, RaSA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that RaSA not only maintains the core advantages of LoRA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at: https://github.com/zwhe99/RaSA.
中文: RaSA通过跨层部分秩共享增强了LoRA的表达能力,在不增加参数的情况下有效提升秩数,显著提高了代码生成和数学推理等复杂任务的性能。
English: RaSA enhances LoRA's expressive capacity by implementing partial rank sharing across layers, effectively increasing ranks without additional parameters and significantly improving performance in demanding tasks like code generation and mathematical reasoning.

Authors:Xiao Wang, Qingyi Si, Jianlong Wu, Shiyu Zhu, Li Cao, Liqiang Nie
Title: AdaReTaKe: Adaptive Redundancy Reduction to Perceive Longer for Video-language Understanding
Abstract:
Multimodal Large Language Models (MLLMs) have revolutionized video understanding, yet are still limited by context length when processing long videos. Recent methods compress videos by leveraging visual redundancy uniformly, yielding promising results. Nevertheless, our quantitative analysis shows that redundancy varies significantly across time and model layers, necessitating a more flexible compression strategy. We propose AdaReTaKe, a training-free method that flexibly reduces visual redundancy by allocating compression ratios among time and layers with theoretical guarantees. Integrated into state-of-the-art MLLMs, AdaReTaKe improves processing capacity from 256 to 2048 frames while preserving critical information. Experiments on VideoMME, MLVU, LongVideoBench, and LVBench datasets demonstrate that AdaReTaKe outperforms existing methods by 2.3% and 2.8% for 7B and 72B models, respectively, with even greater improvements of 5.9% and 6.0% on the longest LVBench. Our code is available at https://github.com/SCZwangxiao/video-FlexReduc.git.
Chinese: AdaReTaKe是一种无需训练的方法,通过自适应地减少时间和模型层间的视觉冗余,使多模态大语言模型能够处理多达2048帧的视频,并在基准数据集上以高达6.0%的优势超越现有方法。
English: AdaReTaKe is a training-free method that adaptively reduces visual redundancy across time and model layers, enabling multimodal large language models to process up to 2048 frames while outperforming existing methods by up to 6.0% on benchmark datasets.

Authors:Ziran Qin, Yuchen Cao, Mingbao Lin, Wen Hu, Shixuan Fan, Ke Cheng, Weiyao Lin, Jianguo Li
Title: CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
Abstract:
Large language models (LLMs) excel at processing long sequences, boosting demand for key-value (KV) caching. While recent efforts to evict KV cache have alleviated the inference burden, they often fail to allocate resources rationally across layers with different attention patterns. In this paper, we introduce Cascading and Adaptive KV cache Eviction (CAKE), a novel approach that frames KV cache eviction as a "cake-slicing problem." CAKE assesses layer-specific preferences by considering attention dynamics in both spatial and temporal dimensions, allocates rational cache size for layers accordingly, and manages memory constraints in a cascading manner. This approach enables a global view of cache allocation, adaptively distributing resources across diverse attention mechanisms while maintaining memory budgets. CAKE also employs a new eviction indicator that considers the shifting importance of tokens over time, addressing limitations in existing methods that overlook temporal dynamics. Comprehensive experiments on LongBench and NeedleBench show that CAKE maintains model performance with only 3.2% of the KV cache and consistently outperforms current baselines across various models and memory constraints, particularly in low-memory settings. Additionally, CAKE achieves over 10x speedup in decoding latency compared to full cache when processing contexts of 128K tokens with FlashAttention-2. Our code is available at https://github.com/antgroup/cakekv.
中文摘要:本文提出CAKE方法,通过空间和时间维度的注意力动态分析,以层间级联方式自适应分配KV缓存资源,在极低缓存条件下保持模型性能,并实现解码延迟的数量级提升。
English Summary: The paper introduces CAKE, an adaptive KV cache eviction method that optimizes memory allocation across layers by considering spatial and temporal attention dynamics, achieving high performance with minimal cache while significantly reducing decoding latency.

Authors:Tsz Chung Cheng, Chung Shing Cheng, Chaak Ming Lau, Eugene Tin-Ho Lam, Chun Yat Wong, Hoi On Yu, Cheuk Hei Chong
Title: HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
Abstract:
The ability of language models to comprehend and interact in diverse linguistic and cultural landscapes is crucial. The Cantonese language used in Hong Kong presents unique challenges for natural language processing due to its rich cultural nuances and lack of dedicated evaluation datasets. The HKCanto-Eval benchmark addresses this gap by evaluating the performance of large language models (LLMs) on Cantonese language understanding tasks, extending to English and Written Chinese for cross-lingual evaluation. HKCanto-Eval integrates cultural and linguistic nuances intrinsic to Hong Kong, providing a robust framework for assessing language models in realistic scenarios. Additionally, the benchmark includes questions designed to tap into the underlying linguistic metaknowledge of the models. Our findings indicate that while proprietary models generally outperform open-weight models, significant limitations remain in handling Cantonese-specific linguistic and cultural knowledge, highlighting the need for more targeted training data and evaluation methods. The code can be accessed at https://github.com/hon9kon9ize/hkeval2025
中文摘要:HKCanto-Eval基准通过融入香港文化特色评估大语言模型的粤语理解能力,研究发现专有模型虽优于开源模型,但在处理粤语特定知识方面仍存在明显不足。
English Summary: The HKCanto-Eval benchmark evaluates large language models' performance on Cantonese language understanding with Hong Kong cultural nuances, revealing that proprietary models outperform open-weight ones but still struggle with Cantonese-specific knowledge.

Authors:Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
Abstract:
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
中文:提出的CTCL框架通过结合差分隐私微调的轻量生成器和基于聚类的主题模型,有效生成隐私保护合成数据,无需大量计算或人工提示即可克服现有方法的局限。
English: The proposed CTCL framework generates privacy-preserving synthetic data by combining a differentially private fine-tuned lightweight generator with a clustering-based topic model, effectively overcoming limitations of existing methods without requiring extensive computation or manual prompts.

Authors:Xin Wang, Samiul Alam, Zhongwei Wan, Hui Shen, Mi Zhang
Title: SVD-LLM V2: Optimizing Singular Value Truncation for Large Language Model Compression
Abstract:
Despite significant advancements, the practical deployment of Large Language Models (LLMs) is often hampered by their immense sizes, highlighting the need for effective compression techniques. Singular Value Decomposition (SVD) is a promising LLM compression technique. However, existing SVD-based compression methods fall short in reducing truncation losses, leading to less competitive performance in compressed models. In this work, we introduce SVD-LLM V2, a SVD-based LLM compression method that optimizes singular value truncation in SVD compression with two techniques. First, SVD-LLM V2 proposes to use theoretical truncation loss of weight matrices to assign a unique compression ratio to each weight matrix at different layers to accommodate weight redundancy heterogeneity. Second, SVD-LLM V2 proposes loss-optimized weight truncation to ensure that the truncated singular values result in a lower and more stable truncation loss in practice. We evaluate SVD-LLM V2 on ten datasets and five LLMs at various scales. Our results show SVD-LLM V2 outperforms state-of-the-art SVD-based LLM compression methods. Our code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM
中文: SVD-LLM V2 通过采用分层压缩比和损失优化截断技术改进SVD压缩,在多个模型与数据集上超越了现有压缩方法。
English: SVD-LLM V2 enhances LLM compression by optimizing singular value truncation with layer-specific ratios and loss-minimizing techniques, outperforming existing methods across multiple models and datasets.

Authors:Cheng Deng, Luoyang Sun, Jiwen Jiang, Yongcheng Zeng, Xinjian Wu, Wenxin Zhao, Qingfa Xiao, Jiachuan Wang, Haoyang Li, Lei Chen, Lionel M. Ni, Haifeng Zhang, Jun Wang
Title: PLM: Efficient Peripheral Language Models Hardware-Co-Designed for Ubiquitous Computing
Abstract:
While scaling laws have been continuously validated in large language models (LLMs) with increasing model parameters, the inherent tension between the inference demands of LLMs and the limited resources of edge devices poses a critical challenge to the development of edge intelligence. Recently, numerous small language models have emerged, aiming to distill the capabilities of LLMs into smaller footprints. However, these models often retain the fundamental architectural principles of their larger counterparts, still imposing considerable strain on the storage and bandwidth capacities of edge devices. In this paper, we introduce the PLM, a Peripheral Language Model, developed through a co-design process that jointly optimizes model architecture and edge system constraints. The PLM utilizes a Multi-head Latent Attention mechanism and employs the squared ReLU activation function to encourage sparsity, thereby reducing peak memory footprint during inference. During training, we collect and reorganize open-source datasets, implement a multi-phase training strategy, and empirically investigate the Warmup-Stable-Decay-Constant (WSDC) learning rate scheduler. Additionally, we incorporate Reinforcement Learning from Human Feedback (RLHF) by adopting the ARIES preference learning approach. Following a two-phase SFT process, this method yields performance gains of 2% in general tasks, 9% in the GSM8K task, and 11% in coding tasks. In addition to its novel architecture, evaluation results demonstrate that PLM outperforms existing small language models trained on publicly available data while maintaining the lowest number of activated parameters. Furthermore, deployment across various edge devices, including consumer-grade GPUs, mobile phones, and Raspberry Pis, validates PLM's suitability for peripheral applications. The PLM series models are publicly available at https://github.com/plm-team/PLM.
中文: PLM是一种与边缘系统协同设计的新型小型语言模型,采用多头潜在注意力机制和平方法ReLU激活函数来降低内存占用,在保持最少激活参数的同时,在多项任务上超越了现有模型性能。
English: The PLM is a novel small language model co-designed with edge system constraints, featuring a Multi-head Latent Attention mechanism and squared ReLU activation to reduce memory usage while outperforming existing models on multiple tasks with the fewest activated parameters.

Authors:Junjie Chen, Xuyang Liu, Subin Huang, Linfeng Zhang, Hang Yu
Title: Seeing Sarcasm Through Different Eyes: Analyzing Multimodal Sarcasm Perception in Large Vision-Language Models
Abstract:
With the advent of large vision-language models (LVLMs) demonstrating increasingly human-like abilities, a pivotal question emerges: do different LVLMs interpret multimodal sarcasm differently, and can a single model grasp sarcasm from multiple perspectives like humans? To explore this, we introduce an analytical framework using systematically designed prompts on existing multimodal sarcasm datasets. Evaluating 12 state-of-the-art LVLMs over 2,409 samples, we examine interpretive variations within and across models, focusing on confidence levels, alignment with dataset labels, and recognition of ambiguous "neutral" cases. We further validate our findings on a diverse 100-sample mini-benchmark, incorporating multiple datasets, expanded prompt variants, and representative commercial LVLMs. Our findings reveal notable discrepancies -- across LVLMs and within the same model under varied prompts. While classification-oriented prompts yield higher internal consistency, models diverge markedly when tasked with interpretive reasoning. These results challenge binary labeling paradigms by highlighting sarcasm's subjectivity. We advocate moving beyond rigid annotation schemes toward multi-perspective, uncertainty-aware modeling, offering deeper insights into multimodal sarcasm comprehension. Our code and data are available at: https://github.com/CoderChen01/LVLMSarcasmAnalysis
Chinese: 本研究探讨不同大型视觉语言模型对多模态讽刺的理解,发现模型之间及同一模型在不同提示下存在显著差异,挑战了二元标注范式,提倡采用多视角建模方法。
English: This study investigates how different large vision-language models interpret multimodal sarcasm, revealing significant variations both across models and within the same model under different prompts, challenging binary labeling and advocating for multi-perspective modeling.

Authors:Tobia Poppi, Tejaswi Kasarla, Pascal Mettes, Lorenzo Baraldi, Rita Cucchiara
Title: Hyperbolic Safety-Aware Vision-Language Models
Abstract:
Addressing the retrieval of unsafe content from vision-language models such as CLIP is an important step towards real-world integration. Current efforts have relied on unlearning techniques that try to erase the model's knowledge of unsafe concepts. While effective in reducing unwanted outputs, unlearning limits the model's capacity to discern between safe and unsafe content. In this work, we introduce a novel approach that shifts from unlearning to an awareness paradigm by leveraging the inherent hierarchical properties of the hyperbolic space. We propose to encode safe and unsafe content as an entailment hierarchy, where both are placed in different regions of hyperbolic space. Our HySAC, Hyperbolic Safety-Aware CLIP, employs entailment loss functions to model the hierarchical and asymmetrical relations between safe and unsafe image-text pairs. This modelling, ineffective in standard vision-language models due to their reliance on Euclidean embeddings, endows the model with awareness of unsafe content, enabling it to serve as both a multimodal unsafe classifier and a flexible content retriever, with the option to dynamically redirect unsafe queries toward safer alternatives or retain the original output. Extensive experiments show that our approach not only enhances safety recognition but also establishes a more adaptable and interpretable framework for content moderation in vision-language models. Our source code is available at https://github.com/aimagelab/HySAC.
中文: 本文提出HySAC模型,通过双曲空间分层编码安全与不安全内容,使视觉语言模型既能有效识别不安全内容,又能灵活控制检索结果,在提升安全性的同时保持模型适应性和可解释性。
English: This paper introduces HySAC, a novel safety-aware CLIP model that uses hyperbolic space to hierarchically encode safe and unsafe content, enabling both effective unsafe content classification and flexible retrieval while maintaining adaptability and interpretability.

Authors:Zhaopeng Feng, Jiahan Ren, Jiayuan Su, Jiamei Zheng, Hongwei Wang, Zuozhu Liu
Title: MT-RewardTree: A Comprehensive Framework for Advancing LLM-Based Machine Translation via Reward Modeling
Abstract:
Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce \textbf{MT-RewardTree}, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. Unlike traditional vanilla preference pair construction, we propose a novel method for automatically generating token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), which mitigates the prohibitive cost of human annotation for fine-grained steps. Then, we establish the first MT-specific reward model benchmark and provide a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences. Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional alignment training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data are released in \href{https://sabijun.github.io/MT_RewardTreePage/}{https://sabijun.github.io/MT\_RewardTreePage}.

Authors:Zhenyu Wang
Title: LogitLens4LLMs: Extending Logit Lens Analysis to Modern Large Language Models
Abstract:
This paper introduces LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models. While Logit Lens has been a crucial method for understanding internal representations of language models, it was previously limited to earlier model architectures. Our work overcomes the limitations of existing implementations, enabling the technique to be applied to state-of-the-art architectures (such as Qwen-2.5 and Llama-3.1) while automating key analytical workflows. By developing component-specific hooks to capture both attention mechanisms and MLP outputs, our implementation achieves full compatibility with the HuggingFace transformer library while maintaining low inference overhead. The toolkit provides both interactive exploration and batch processing capabilities, supporting large-scale layer-wise analyses. Through open-sourcing our implementation, we aim to facilitate deeper investigations into the internal mechanisms of large-scale language models. The toolkit is openly available at https://github.com/zhenyu-02/LogitLens4LLMs.
中文: 本文介绍了LogitLens4LLMs工具包,它将Logit Lens技术扩展到现代大语言模型,支持对Qwen-2.5和Llama-3.1等先进架构的内部机制进行交互式和批量分析。
English: This paper presents LogitLens4LLMs, a toolkit that extends the Logit Lens technique to modern large language models like Qwen-2.5 and Llama-3.1, enabling comprehensive analysis of internal mechanisms with minimal performance impact.

Authors:Zhiliang Chen, Xinyuan Niu, Chuan-Sheng Foo, Bryan Kian Hsiang Low
Title: Broaden your SCOPE! Efficient Multi-turn Conversation Planning for LLMs with Semantic Space
Abstract:
Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the optimal LLM response at each turn. Existing simulation-based conversation planning algorithms typically select the optimal response by simulating future conversations with a large number of LLM queries at every turn. However, this process is extremely time-consuming and hence impractical for real-time conversations. This paper presents a novel approach called Semantic space COnversation Planning with improved Efficiency (SCOPE) that exploits the dense semantic representation of conversations to perform conversation planning efficiently. In particular, SCOPE models the stochastic transitions in conversation semantics and their associated rewards to plan entirely within the semantic space. This allows us to select the optimal LLM response at every conversation turn without needing additional LLM queries for simulation. As a result, SCOPE can perform conversation planning 70 times faster than conventional simulation-based planning algorithms when applied to a wide variety of conversation starters and two reward functions seen in the real world, yet achieving a higher reward within a practical planning budget. Our code can be found at: https://github.com/chenzhiliang94/convo-plan-SCOPE.
Chinese: SCOPE提出了一种在语义空间内进行高效对话规划的新方法,通过建模对话语义的随机转换,无需额外的大语言模型查询模拟,实现了比传统算法快70倍的速度,同时获得更高的奖励收益。
English: SCOPE introduces a novel method for efficient conversation planning by modeling transitions in semantic space, eliminating the need for time-consuming LLM simulations and achieving 70 times faster performance while improving reward outcomes.

Authors:Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Title: TikZero: Zero-Shot Text-Guided Graphics Program Synthesis
Abstract:
Automatically synthesizing figures from text captions is a compelling capability. However, achieving high geometric precision and editability requires representing figures as graphics programs in languages like TikZ, and aligned training data (i.e., graphics programs with captions) remains scarce. Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available. We reconcile these disparate data sources by presenting TikZero, which decouples graphics program generation from text understanding by using image representations as an intermediary bridge. It enables independent training on graphics programs and captioned images and allows for zero-shot text-guided graphics program synthesis during inference. We show that our method substantially outperforms baselines that can only operate with caption-aligned graphics programs. Furthermore, when leveraging caption-aligned graphics programs as a complementary training signal, TikZero matches or exceeds the performance of much larger models, including commercial systems like GPT-4o. Our code, datasets, and select models are publicly available.
中文摘要:TikZero通过图像表示将文本理解与图形程序生成解耦,实现了零样本文本到图形程序的合成,其性能超越基线模型,并在补充对齐数据时达到或超过大型商业系统的水平。
English Summary: TikZero enables zero-shot text-to-graphics program synthesis by decoupling text understanding from program generation through image representations, outperforming baseline models and matching larger commercial systems when supplemented with aligned data.

Authors:Balaji Rama, Kai Mei, Yongfeng Zhang
Title: Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery
Abstract:
Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, distribution and discovery of agents. We present Cerebrum, an Agent SDK for AIOS that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; (3) an interactive web interface for testing and evaluating agents. The platform's effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-use agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their agents. The live website is at https://app.aios.foundation, the code is at https://github.com/agiresearch/Cerebrum, and video is at https://app.aios.foundation/video-demo.
中文: Cerebrum作为AIOS的智能体SDK,通过模块化架构、社区平台和交互界面,为自主LLM智能体的开发、共享与评估提供了标准化解决方案,推动了该领域的统一发展。
English: Cerebrum is an Agent SDK for AIOS that standardizes autonomous LLM-based agent development through a modular architecture, community hub, and interactive interface, advancing the field with a unified framework.

Authors:Fengyu Li, Yilin Li, Junhao Zhu, Lu Chen, Yanfei Zhang, Jia Zhou, Hui Zu, Jingwen Zhao, Yunjun Gao
Title: AIstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation
Abstract:
Huawei has always been committed to exploring the AI application in historical research. Biography generation, as a specialized form of abstractive summarization, plays a crucial role in historical research but faces unique challenges that existing large language models (LLMs) struggle to address. These challenges include maintaining stylistic adherence to historical writing conventions, ensuring factual fidelity, and handling fragmented information across multiple documents. We present AIstorian, a novel end-to-end agentic system featured with a knowledge graph (KG)-powered retrieval-augmented generation (RAG) and anti-hallucination multi-agents. Specifically, AIstorian introduces an in-context learning based chunking strategy and a KG-based index for accurate and efficient reference retrieval. Meanwhile, AIstorian orchestrates multi-agents to conduct on-the-fly hallucination detection and error-type-aware correction. Additionally, to teach LLMs a certain language style, we finetune LLMs based on a two-step training approach combining data augmentation-enhanced supervised fine-tuning with stylistic preference optimization. Extensive experiments on a real-life historical Jinshi dataset demonstrate that AIstorian achieves a 3.8x improvement in factual accuracy and a 47.6% reduction in hallucination rate compared to existing baselines. The data and code are available at: https://github.com/ZJU-DAILY/AIstorian.
Chinese: 华为开发的AIstorian系统通过基于知识图谱的检索增强生成与多智能体抗幻觉框架,在历史传记生成中实现了事实准确性提升3.8倍、幻觉率降低47.6%的突破性进展。
English: Huawei's AIstorian system enhances historical biography generation by employing a knowledge graph-based RAG and multi-agent anti-hallucination framework, achieving a 3.8x boost in factual accuracy and a 47.6% reduction in hallucinations compared to existing methods.

Authors:Michael Hanna, Yonatan Belinkov, Sandro Pezzelle
Title: Are formal and functional linguistic mechanisms dissociated in language models?
Abstract:
Although large language models (LLMs) are increasingly capable, these capabilities are unevenly distributed: they excel at formal linguistic tasks, such as producing fluent, grammatical text, but struggle more with functional linguistic tasks like reasoning and consistent fact retrieval. Inspired by neuroscience, recent work suggests that to succeed on both formal and functional linguistic tasks, LLMs should use different mechanisms for each; such localization could either be built-in or emerge spontaneously through training. In this paper, we ask: do current models, with fast-improving functional linguistic abilities, exhibit distinct localization of formal and functional linguistic mechanisms? We answer this by finding and comparing the "circuits", or minimal computational subgraphs, responsible for various formal and functional tasks. Comparing 5 LLMs across 10 distinct tasks, we find that while there is indeed little overlap between circuits for formal and functional tasks, there is also little overlap between formal linguistic tasks, as exists in the human brain. Thus, a single formal linguistic network, unified and distinct from functional task circuits, remains elusive. However, in terms of cross-task faithfulness - the ability of one circuit to solve another's task - we observe a separation between formal and functional mechanisms, suggesting that shared mechanisms between formal tasks may exist.
中文: 大语言模型在形式与功能语言任务间的处理回路重叠甚少,尚未形成统一的形式网络,但跨任务忠实性表明形式任务可能存在共享机制。
English: Large language models exhibit minimal overlap between circuits for formal and functional linguistic tasks, with no unified formal network emerging, yet cross-task faithfulness suggests potential shared mechanisms for formal tasks.

Authors:Yuanshuo Zhang, Yuchen Hou, Bohan Tang, Shuo Chen, Muhan Zhang, Xiaowen Dong, Siheng Chen
Title: GNNs as Predictors of Agentic Workflow Performances
Abstract:
Agentic workflows invoked by Large Language Models (LLMs) have achieved remarkable success in handling complex tasks. However, optimizing such workflows is costly and inefficient in real-world applications due to extensive invocations of LLMs. To fill this gap, this position paper formulates agentic workflows as computational graphs and advocates Graph Neural Networks (GNNs) as efficient predictors of agentic workflow performances, avoiding repeated LLM invocations for evaluation. To empirically ground this position, we construct FLORA-Bench, a unified platform for benchmarking GNNs for predicting agentic workflow performances. With extensive experiments, we arrive at the following conclusion: GNNs are simple yet effective predictors. This conclusion supports new applications of GNNs and a novel direction towards automating agentic workflow optimization. All codes, models, and data are available at https://github.com/youngsoul0731/Flora-Bench.
中文: 本文提出将智能体工作流建模为计算图,并利用图神经网络(GNN)作为高效性能预测器,通过FLORA-Bench平台验证了该方法能有效避免重复调用大语言模型,为工作流优化开辟了新方向。
English: This paper proposes using Graph Neural Networks (GNNs) to efficiently predict the performance of agentic workflows modeled as computational graphs, avoiding costly repeated LLM invocations and demonstrating their effectiveness through the FLORA-Bench platform.

Authors:Sahil Kale, Vijaykant Nadadur
Title: Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
Abstract:
As LLMs grow more powerful, their most profound achievement may be recognising when to say "I don't know". Existing studies on LLM self-knowledge have been largely constrained by human-defined notions of feasibility, often neglecting the reasons behind unanswerability by LLMs and failing to study deficient types of self-knowledge. This study aims to obtain intrinsic insights into different types of LLM self-knowledge with a novel methodology: allowing them the flexibility to set their own feasibility boundaries and then analysing the consistency of these limits. We find that even frontier models like GPT-4o and Mistral Large are not sure of their own capabilities more than 80% of the time, highlighting a significant lack of trustworthiness in responses. Our analysis of confidence balance in LLMs indicates that models swing between overconfidence and conservatism in feasibility boundaries depending on task categories and that the most significant self-knowledge weaknesses lie in temporal awareness and contextual understanding. These difficulties in contextual comprehension additionally lead models to question their operational boundaries, resulting in considerable confusion within the self-knowledge of LLMs. We make our code and results available publicly at https://github.com/knowledge-verse-ai/LLM-Self_Knowledge_Eval
中文: 研究表明即使如GPT-4o和Mistral Large等前沿大语言模型也普遍缺乏可靠的自我认知能力,超过80%的情况下无法准确判断自身能力边界,尤其在时间感知和语境理解方面存在显著缺陷。
English: This study reveals that even advanced LLMs like GPT-4o and Mistral Large lack reliable self-knowledge, frequently struggling with temporal awareness and contextual understanding which causes them to question their own operational boundaries over 80% of the time.

Authors:Haoyang Huang, Guoqing Ma, Nan Duan, Xing Chen, Changyi Wan, Ranchen Ming, Tianyu Wang, Bo Wang, Zhiying Lu, Aojie Li, Xianfang Zeng, Xinhao Zhang, Gang Yu, Yuhe Yin, Qiling Wu, Wen Sun, Kang An, Xin Han, Deshan Sun, Wei Ji, Bizhu Huang, Brian Li, Chenfei Wu, Guanzhe Huang, Huixin Xiong, Jiaxin He, Jianchang Wu, Jianlong Yuan, Jie Wu, Jiashuai Liu, Junjing Guo, Kaijun Tan, Liangyu Chen, Qiaohui Chen, Ran Sun, Shanshan Yuan, Shengming Yin, Sitong Liu, Wei Chen, Yaqi Dai, Yuchu Luo, Zheng Ge, Zhisheng Guan, Xiaoniu Song, Yu Zhou, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Yi Xiu, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Title: Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
Abstract:
We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs. We build Step-Video-TI2V-Eval as a new benchmark for the text-driven image-to-video task and compare Step-Video-TI2V with open-source and commercial TI2V engines using this dataset. Experimental results demonstrate the state-of-the-art performance of Step-Video-TI2V in the image-to-video generation task. Both Step-Video-TI2V and Step-Video-TI2V-Eval are available at https://github.com/stepfun-ai/Step-Video-TI2V.
中文摘要:Step-Video-TI2V作为先进的300亿参数模型,能够根据文本和图像输入生成视频,在新基准测试中相比现有方案展现了最优性能。
English Summary: Step-Video-TI2V is a state-of-the-art 30B-parameter model that generates videos from text and image inputs, achieving top performance on a new benchmark compared to existing solutions.

Authors:Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, Jian Luan
Title: Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
Abstract:
Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks. However, the audio modality has largely been overlooked in these developments. Thus, we conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task. We leverage the group relative policy optimization (GRPO) algorithm to Qwen2-Audio-7B-Instruct, and our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%. The main findings in this technical report are as follows: 1) The GRPO algorithm can be effectively applied to large audio language models (LALMs), even when the model has only 8.2B parameters; 2) With only 38k post-training samples, RL significantly outperforms supervised fine-tuning (SFT), indicating that RL-based approaches can be effective without large datasets; 3) The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently utilize deep thinking remains an open question for further research; 4) LALMs still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further exploration. Our project is available at https://github.com/xiaomi-research/r1-aqa and https://huggingface.co/mispeech/r1-aqa.
中文:强化学习显著提升了大型音频语言模型的理解能力,通过GRPO算法在MMAU基准测试中取得领先性能,但在听觉推理方面仍远逊于人类水平。
English: Reinforcement learning significantly enhances audio understanding in large audio language models, achieving state-of-the-art performance on the MMAU benchmark with the GRPO algorithm, yet still lags behind human auditory reasoning.

Authors:Rachel S. Y. Teo, Tan M. Nguyen
Title: MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling
Abstract:
Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP). The prevalence of data coupled with computational resources has led to large models with a considerable number of parameters. While the massive size of these models has led to remarkable success in many NLP tasks, a detriment is the expense required to retrain all the base model's parameters for the adaptation to each task or domain. Parameter Efficient Fine-Tuning (PEFT) provides an effective solution for this challenge by minimizing the number of parameters required to be fine-tuned while maintaining the quality of the model. While existing methods have achieved impressive results, they mainly focus on adapting a subset of parameters, weight reparameterization, and prompt engineering. In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We then propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts (SMoE) whose experts are layers in the pre-trained model. It performs a conditional computation of a mixture of layers during fine-tuning to provide the model with more structural knowledge about the data. By providing an avenue for information exchange between layers, MoLEx enables the model to make a more well-informed prediction for the downstream task, leading to better fine-tuning results with the same number of effective parameters. As experts can be processed in parallel, MoLEx introduces minimal additional computational overhead. We empirically corroborate the advantages of MoLEx when combined with popular PEFT baseline methods on a variety of downstream fine-tuning tasks, including the popular GLUE benchmark as well as the End-to-End Challenge (E2E). The code is publicly available at https://github.com/rachtsy/molex.
中文: 大规模预训练模型在任务适应时面临高昂的重训练成本,而提出的层专家混合方法通过将模型层作为专家组合,以最小计算开销增强结构知识交换,实现了高效微调。
English: Large-scale pre-trained models face high retraining costs for task adaptation, but the proposed Mixture of Layer Experts (MoLEx) method enables efficient fine-tuning by combining layers as experts to enhance structural knowledge exchange with minimal computational overhead.

Authors:Guihong Li, Mehdi Rezagholizadeh, Mingyu Yang, Vikram Appia, Emad Barsoum
Title: X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Abstract:
Multi-head latent attention (MLA) is designed to optimize KV cache memory through low-rank key-value joint compression. Rather than caching keys and values separately, MLA stores their compressed latent representations, reducing memory overhead while maintaining the performance. While MLA improves memory efficiency without compromising language model accuracy, its major limitation lies in its integration during the pre-training phase, requiring models to be trained from scratch. This raises a key question: can we use MLA's benefits fully or partially in models that have already been pre-trained with different attention mechanisms? In this paper, we propose X-EcoMLA to deploy post training distillation to enable the upcycling of Transformer-based attention into an efficient hybrid MLA variant through lightweight post-training adaptation, bypassing the need for extensive pre-training. We demonstrate that leveraging the dark knowledge of a well-trained model can enhance training accuracy and enable extreme KV cache compression in MLA without compromising model performance. The experimental results show that our proposed method can effectively compress the KV cache while preserving the performance on the benchmarks; specifically, for Llama3.2-1B-Instruct baseline, a 6.4x compression achieves the same average score by using only 3.6B training tokens and 70 GPU hours on AMD MI300, whereas a 10.6x compression have less than 0.1% average score drop with 7B training tokens and 140 GPU hours. The code for this work is available at https://github.com/AMD-AGI/AMD-Hybrid-Models.
中文: X-EcoMLA通过轻量级后训练蒸馏,将预训练的Transformer模型升级为混合多头潜在注意力变体,在保持性能的同时实现了显著的KV缓存压缩。
English: X-EcoMLA enables efficient post-training adaptation of pre-trained Transformer models into a hybrid multi-head latent attention variant, achieving significant KV cache compression without performance loss through lightweight distillation.

Authors:Wuwei Huang, Renren Jin, Wen Zhang, Jian Luan, Bin Wang, Deyi Xiong
Title: Joint Training And Decoding for Multilingual End-to-End Simultaneous Speech Translation
Abstract:
Recent studies on end-to-end speech translation(ST) have facilitated the exploration of multilingual end-to-end ST and end-to-end simultaneous ST. In this paper, we investigate end-to-end simultaneous speech translation in a one-to-many multilingual setting which is closer to applications in real scenarios. We explore a separate decoder architecture and a unified architecture for joint synchronous training in this scenario. To further explore knowledge transfer across languages, we propose an asynchronous training strategy on the proposed unified decoder architecture. A multi-way aligned multilingual end-to-end ST dataset was curated as a benchmark testbed to evaluate our methods. Experimental results demonstrate the effectiveness of our models on the collected dataset. Our codes and data are available at: https://github.com/XiaoMi/TED-MMST.
中文摘要:本文研究多语言端到端同声传译,提出分离式与统一式解码器架构及同步与异步训练策略,并在自建多语言数据集上验证了模型有效性。
English Summary: This paper explores end-to-end simultaneous speech translation in multilingual settings, proposing separate and unified decoder architectures with joint synchronous and asynchronous training strategies, validated on a curated multilingual dataset.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: RONA: Pragmatically Diverse Image Captioning with Coherence Relations
Abstract:
Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance caption diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. We propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as a controllable axis for pragmatic variations. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA
中文: 传统写作助手通过句法和语义变化生成多样的图像描述,而人类描述则注重利用语用线索传达核心信息与视觉细节;为此提出的RONA策略,通过连贯关系作为可控轴,使多模态大语言模型生成的描述在多样性和真实性上优于基线模型。
English: Traditional writing assistants create diverse image captions through syntactic and semantic variations, but human captions emphasize conveying a central message with visual details using pragmatic cues, leading to the development of RONA, a prompting strategy for MLLMs that uses coherence relations to improve caption diversity and alignment with ground truth.

Authors:Gaotang Li, Yuzhong Chen, Hanghang Tong
Title: Taming Knowledge Conflicts in Language Models
Abstract:
Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the superposition of contextual information and parametric memory, where highly influential attention heads simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JuICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JuICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JuICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JuICE in these settings. Our code is available at https://github.com/GaotangLi/JUICE.
中文: 本研究揭示了语言模型中注意力头能同时处理上下文与参数记忆,产生叠加效应,并提出了JuICE这一无需微调的测试时干预方法,有效解决知识冲突,在多种数据集和模型上实现了最优性能。
English: This study reveals that attention heads in language models can simultaneously handle both contextual and parametric knowledge, leading to a superposition effect, and introduces JuICE, a test-time intervention method that effectively resolves knowledge conflicts without fine-tuning, achieving state-of-the-art performance across diverse datasets and models.

Authors:Kai Zhang, Jianwei Yang, Jeevana Priya Inala, Chandan Singh, Jianfeng Gao, Yu Su, Chenglong Wang
Title: Towards Understanding Graphical Perception in Large Multimodal Models
Abstract:
Despite the promising results of large multimodal models (LMMs) in complex vision-language tasks that require knowledge, reasoning, and perception abilities together, we surprisingly found that these models struggle with simple tasks on infographics that require perception only. As existing benchmarks primarily focus on end tasks that require various abilities, they provide limited, fine-grained insights into the limitations of the models' perception abilities. To address this gap, we leverage the theory of graphical perception, an approach used to study how humans decode visual information encoded on charts and graphs, to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts. With automated task generation and response evaluation designs, our framework enables comprehensive and controlled testing of LMMs' graphical perception across diverse chart types, visual elements, and task types. We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three granularity levels (chart, visual element, and pixel). Our findings underscore several critical limitations of current state-of-the-art LMMs, including GPT-4o: their inability to (1) generalize across chart types, (2) understand fundamental visual elements, and (3) cross reference values within a chart. These insights provide guidance for future improvements in perception abilities of LMMs. The evaluation framework and labeled data are publicly available at https://github.com/microsoft/lmm-graphical-perception.
Large multimodal models surprisingly struggle with simple perception tasks on infographics, prompting the development of an evaluation framework based on graphical perception theory that reveals critical limitations in their ability to generalize across chart types and understand visual elements.
English Summary:

Authors:Yafei Zhang, Murray Wang, Yu Wang, Xiaohui Wang
Title: RankPO: Preference Optimization for Job-Talent Matching
Abstract:
Matching job descriptions (JDs) with suitable talent requires models capable of understanding not only textual similarities between JDs and candidate resumes but also contextual factors such as geographical location and academic seniority. To address this challenge, we propose a two-stage training framework for large language models (LLMs). In the first stage, a contrastive learning approach is used to train the model on a dataset constructed from real-world matching rules, such as geographical alignment and research area overlap. While effective, this model primarily learns patterns that defined by the matching rules. In the second stage, we introduce a novel preference-based fine-tuning method inspired by Direct Preference Optimization (DPO), termed Rank Preference Optimization (RankPO), to align the model with AI-curated pairwise preferences emphasizing textual understanding. Our experiments show that while the first-stage model achieves strong performance on rule-based data (nDCG@20 = 0.706), it lacks robust textual understanding (alignment with AI annotations = 0.46). By fine-tuning with RankPO, we achieve a balanced model that retains relatively good performance in the original tasks while significantly improving the alignment with AI preferences. The code and data are available at https://github.com/yflyzhang/RankPO.
中文: 本文提出一个两阶段训练框架,先通过基于现实匹配规则的对比学习训练大语言模型,再采用新型排序偏好优化方法强化文本理解能力,最终获得既能保持规则匹配性能、又显著提升与AI偏好对齐度的平衡模型。
English: This paper introduces a two-stage training framework for large language models that first uses contrastive learning based on real-world matching rules and then applies a novel Rank Preference Optimization method to enhance textual understanding, resulting in a balanced model that maintains rule-based performance while significantly improving alignment with AI preferences.

Authors:Xin Liu, Pei Liu, Guoming Tang
Title: ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
Abstract:
The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5\% of baseline) while sustaining comparable generation quality, coupled with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.
中文摘要:ZSMerge是一种动态KV缓存压缩框架,通过细粒度令牌重要性评估、残差合并和零样本自适应机制,在保持生成质量的同时实现20:1的压缩比,显著提升大语言模型的内存效率和推理速度。
English Summary: ZSMerge is a dynamic KV cache compression framework that enhances memory efficiency and inference speed in large language models through fine-grained token importance evaluation, residual merging, and zero-shot adaptation, achieving a 20:1 compression ratio with minimal performance loss.

Authors:Zhongzhan Huang, Guoming Ling, Yupei Lin, Yandong Chen, Shanshan Zhong, Hefeng Wu, Liang Lin
Title: RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
Abstract:
Routing large language models (LLMs) is a new paradigm that uses a router to recommend the best LLM from a pool of candidates for a given input. In this paper, our comprehensive analysis with more than 8,500 LLMs reveals a novel model-level scaling up phenomenon in Routing LLMs, i.e., a capable router can significantly enhance the performance of this paradigm as the number of candidates increases. This improvement can even surpass the performance of the best single model in the pool and many existing strong LLMs, confirming it a highly promising paradigm. However, the lack of comprehensive and open-source benchmarks for Routing LLMs has hindered the development of routers. In this paper, we introduce RouterEval, a benchmark tailored for router research, which includes over 200,000,000 performance records for 12 popular LLM evaluations across various areas such as commonsense reasoning, semantic understanding, etc., based on over 8,500 various LLMs. Using RouterEval, extensive evaluations of existing Routing LLM methods reveal that most still have significant room for improvement. See https://github.com/MilkThink-Lab/RouterEval for all data, code and tutorial.
中文: 路由大语言模型通过从候选池中选择最优模型来提升性能,新推出的RouterEval基准测试揭示了该方法仍有巨大改进空间,并为路由器的开发提供了海量数据支持。
English: Routing large language models enhances performance by selecting the best model from a pool, with the new RouterEval benchmark revealing significant improvement potential and providing extensive data for router development.

Authors:Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Title: Transformers without Normalization
Abstract:
Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(α$x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to language models. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Authors:Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Title: TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention
Abstract:
Object Hallucination (OH) has been acknowledged as one of the major trustworthy challenges in Large Vision-Language Models (LVLMs). Recent advancements in Large Language Models (LLMs) indicate that internal states, such as hidden states, encode the "overall truthfulness" of generated responses. However, it remains under-explored how internal states in LVLMs function and whether they could serve as "per-token" hallucination indicators, which is essential for mitigating OH. In this paper, we first conduct an in-depth exploration of LVLM internal states in relation to OH issues and discover that (1) LVLM internal states are high-specificity per-token indicators of hallucination behaviors. Moreover, (2) different LVLMs encode universal patterns of hallucinations in common latent subspaces, indicating that there exist "generic truthful directions" shared by various LVLMs. Based on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt) that first learns the truthful direction of LVLM decoding and then applies truthful-guided inference-time intervention during LVLM decoding. We further propose ComnHallu to enhance both cross-LVLM and cross-data hallucination detection transferability by constructing and aligning hallucination latent subspaces. We evaluate TruthPrInt in extensive experimental settings, including in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks. Experimental results indicate that TruthPrInt significantly outperforms state-of-the-art methods. Codes will be available at https://github.com/jinhaoduan/TruthPrInt.
Chinese: 物体幻觉是大视觉语言模型中的主要可信挑战,但内部状态可作为逐令牌指示器并揭示通用模式,由此提出的TruthPrInt方法通过真实性引导干预显著优于现有方法。
English: Object Hallucination is a major trust issue in Large Vision-Language Models, but internal states can serve as per-token indicators and reveal universal patterns, leading to the proposed TruthPrInt method that significantly outperforms existing approaches through truthful-guided intervention.

Authors:Florian Eichin, Yang Janet Liu, Barbara Plank, Michael A. Hedderich
Title: Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set
Abstract:
Discourse understanding is essential for many NLP tasks, yet most existing work remains constrained by framework-dependent discourse representations. This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks. We address this question along two dimensions: (1) developing a unified discourse relation label set to facilitate cross-lingual and cross-framework discourse analysis, and (2) probing LLMs to assess whether they encode generalizable discourse abstractions. Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities. Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks. Further layer-wise analyses reveal that language generalization at the discourse level is most salient in the intermediate layers. Lastly, our error analysis provides an account of challenging relation classes.
Chinese: 本研究探讨大型语言模型如何跨语言和框架捕获可泛化的话语知识,发现多语言模型在跨语言话语关系分类中表现优异,其中间层展现出最强的泛化能力。
English: This study explores how large language models (LLMs) capture generalizable discourse knowledge across languages and frameworks, finding that multilingual models excel in cross-lingual discourse relation classification with intermediate layers showing the strongest generalization.

Authors:Liang Wen, Yunke Cai, Fenrui Xiao, Xin He, Qi An, Zhenyu Duan, Yimin Du, Junchen Liu, Lifu Tang, Xiaowei Lv, Haosheng Zou, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang
Title: Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
Abstract:
This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.
中文: 本文介绍Light-R1开源框架,它采用经济高效的课程学习方法和公开数据训练长推理模型,在数学推理上达到先进水平并展现强大跨领域泛化能力。
English: This paper presents Light-R1, an open-source framework for training long reasoning models using a cost-effective curriculum approach with public data, achieving state-of-the-art performance in math reasoning and strong cross-domain generalization.

Authors:Wenhao Hu, Jinhao Duan, Chunchen Wei, Li Zhang, Yue Zhang, Kaidi Xu
Title: DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
Abstract:
The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code. Our benchmark and evaluation code are available at https://github.com/HWH-2000/DynaCode.
中文: DynaCode 是一种动态、复杂度感知的代码基准,通过结合代码复杂度和调用图结构来系统评估大语言模型,有效克服静态数据集的局限性,揭示了模型性能显著下降并深入解析其行为特征。
English: DynaCode is a dynamic, complexity-aware benchmark designed to address the limitations of static code datasets by evaluating large language models using code complexity and call-graph structures, revealing significant performance drops and providing insights into model behavior.

Authors:Maxim Popov, Regina Kurkova, Mikhail Iumanov, Jaafar Mahmoud, Sergey Kolyubin
Title: OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Abstract:
Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a Scene Graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Project page is available at https://be2rlab.github.io/OSMa-Bench/.

Authors:Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang
Title: VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Abstract:
We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies. Specifically, our model improves the reasoning performance of three types of MLLMs and four different model scales. Even when applied to the highly capable InternVL2.5-78B, it achieves a 5.9-point improvement across seven multimodal reasoning benchmarks. Experimental results show that our model exhibits superior performance compared to Outcome Reward Models and Self-Consistency during BoN evaluation. To facilitate the training of multimodal PRMs, we construct a multimodal process supervision dataset VisualPRM400K using an automated data pipeline. For the evaluation of multimodal PRMs, we propose VisualProcessBench, a benchmark with human-annotated step-wise correctness labels, to measure the abilities of PRMs to detect erroneous steps in multimodal reasoning tasks. We hope that our work can inspire more future research and contribute to the development of MLLMs. Our model, data, and benchmark are released in https://internvl.github.io/blog/2025-03-13-VisualPRM/.
中文: VisualPRM是一个拥有80亿参数的多模态过程奖励模型,通过最佳N选评估策略提升不同多模态大语言模型的推理能力,并构建了新数据集和基准以推动该领域发展。
English: VisualPRM is an 8-billion-parameter multimodal Process Reward Model that enhances reasoning across various MLLMs through Best-of-N evaluation, achieving significant performance gains and introducing a new dataset and benchmark for future research.

Authors:Julian Schelb, Orr Borin, David Garcia, Andreas Spitz
Title: R.U.Psycho? Robust Unified Psychometric Testing of Language Models
Abstract:
Generative language models are increasingly being subjected to psychometric questionnaires intended for human testing, in efforts to establish their traits, as benchmarks for alignment, or to simulate participants in social science experiments. While this growing body of work sheds light on the likeness of model responses to those of humans, concerns are warranted regarding the rigour and reproducibility with which these experiments may be conducted. Instabilities in model outputs, sensitivity to prompt design, parameter settings, and a large number of available model versions increase documentation requirements. Consequently, generalization of findings is often complex and reproducibility is far from guaranteed. In this paper, we present R.U.Psycho, a framework for designing and running robust and reproducible psychometric experiments on generative language models that requires limited coding expertise. We demonstrate the capability of our framework on a variety of psychometric questionnaires, which lend support to prior findings in the literature. R.U.Psycho is available as a Python package at https://github.com/julianschelb/rupsycho.
中文摘要:本文提出了R.U.Psycho框架,旨在提高生成语言模型心理测量实验的稳健性和可重复性,通过多种问卷验证了先前研究结果,并解决了输出不稳定性和提示敏感性等挑战。
English Summary: The paper introduces R.U.Psycho, a framework designed to enhance the robustness and reproducibility of psychometric experiments on generative language models, addressing challenges like output instability and prompt sensitivity while validating prior findings through various questionnaires.

Authors:Jinze Li, Yixing Xu, Haiduo Huang, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum
Title: Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding
Abstract:
Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness.
中文: Gumiho是一种混合推测解码模型,结合了串行与并行头部,通过为早期令牌使用复杂Transformer提升准确性,后期令牌采用轻量级MLP提高效率,从而在性能上超越现有方法。
English: Gumiho is a hybrid speculative decoding model that combines serial and parallel heads, using a sophisticated Transformer for early tokens to boost accuracy and lightweight MLPs for later tokens to enhance efficiency, outperforming existing methods.

Authors:Chunyi Li, Xiaozhe Li, Zicheng Zhang, Yuan Tian, Ziheng Jia, Xiaohong Liu, Xiongkuo Min, Jia Wang, Haodong Duan, Kai Chen, Guangtao Zhai
Title: Information Density Principle for MLLM Benchmarks
Abstract:
With the emergence of Multimodal Large Language Models (MLLMs), hundreds of benchmarks have been developed to ensure the reliability of MLLMs in downstream tasks. However, the evaluation mechanism itself may not be reliable. For developers of MLLMs, questions remain about which benchmark to use and whether the test results meet their requirements. Therefore, we propose a critical principle of Information Density, which examines how much insight a benchmark can provide for the development of MLLMs. We characterize it from four key dimensions: (1) Fallacy, (2) Difficulty, (3) Redundancy, (4) Diversity. Through a comprehensive analysis of more than 10,000 samples, we measured the information density of 19 MLLM benchmarks. Experiments show that using the latest benchmarks in testing can provide more insight compared to previous ones, but there is still room for improvement in their information density. We hope this principle can promote the development and application of future MLLM benchmarks. Project page: https://github.com/lcysyzxdxc/bench4bench
Chinese: 本研究提出信息密度原则,从四个维度评估多模态大语言模型基准的有效性,分析了19个基准后发现,尽管新基准提供更多洞见,但其信息密度仍有提升空间。
English: The study introduces the principle of Information Density to evaluate the effectiveness of Multimodal Large Language Model benchmarks, analyzing 19 benchmarks across four dimensions and finding that while newer ones offer more insights, their information density still needs enhancement.

Authors:Zhenyu Liu, Dongfang Li, Xinshuo Hu, Xinping Zhao, Yibin Chen, Baotian Hu, Min Zhang
Title: Take Off the Training Wheels Progressive In-Context Learning for Effective Alignment
Abstract:
Recent studies have explored the working mechanisms of In-Context Learning (ICL). However, they mainly focus on classification and simple generation tasks, limiting their broader application to more complex generation tasks in practice. To address this gap, we investigate the impact of demonstrations on token representations within the practical alignment tasks. We find that the transformer embeds the task function learned from demonstrations into the separator token representation, which plays an important role in the generation of prior response tokens. Once the prior response tokens are determined, the demonstrations become redundant.Motivated by this finding, we propose an efficient Progressive In-Context Alignment (PICA) method consisting of two stages. In the first few-shot stage, the model generates several prior response tokens via standard ICL while concurrently extracting the ICL vector that stores the task function from the separator token representation. In the following zero-shot stage, this ICL vector guides the model to generate responses without further demonstrations.Extensive experiments demonstrate that our PICA not only surpasses vanilla ICL but also achieves comparable performance to other alignment tuning methods. The proposed training-free method reduces the time cost (e.g., 5.45+) with improved alignment performance (e.g., 6.57+). Consequently, our work highlights the application of ICL for alignment and calls for a deeper understanding of ICL for complex generations. The code will be available at https://github.com/HITsz-TMG/PICA.
Chinese: 本研究提出渐进式上下文对齐方法(PICA),通过两阶段设计先从示例中提取任务函数再用于零样本生成,在降低时间成本的同时实现了比标准上下文学习更优的对齐性能。
English: This study introduces Progressive In-Context Alignment (PICA), a two-stage method that first extracts task functions from demonstrations and then uses them for zero-shot generation, achieving superior alignment performance while reducing time costs compared to standard in-context learning.

Authors:Allison Andreyev
Title: Quantization for OpenAI's Whisper Models: A Comparative Analysis
Abstract:
Automated speech recognition (ASR) models have gained prominence for applications such as captioning, speech translation, and live transcription. This paper studies Whisper and two model variants: one optimized for live speech streaming and another for offline transcription. Notably, these models have been found to generate hallucinated content, reducing transcription reliability. Furthermore, larger model variants exhibit increased latency and pose challenges for deployment on resource-constrained devices. This study analyzes the similarities and differences between three Whisper models, qualitatively examining their distinct capabilities. Next, this study quantifies the impact of model quantization on latency and evaluates its viability for edge deployment. Using the open source LibriSpeech dataset, this paper evaluates the word error rate (WER) along with latency analysis of whispercpp using 3 quantization methods (INT4, INT5, INT8). Results show that quantization reduces latency by 19\% and model size by 45\%, while preserving transcription accuracy. These findings provide insights into the optimal use cases of different Whisper models and edge device deployment possibilities. All code, datasets, and implementation details are available in a public GitHub repository: https://github.com/allisonandreyev/WhisperQuantization.git
中文: 本研究分析了三种Whisper语音识别模型,发现量化技术在保持精度的同时显著降低了延迟和模型体积,为边缘设备部署提供了可行方案。
English: This study analyzes three Whisper ASR models, revealing that quantization techniques significantly reduce latency and model size while maintaining accuracy, offering practical solutions for edge device deployment.

Authors:Abhipsha Das, Nicholas Lourie, Siavash Golkar, Mariel Pettee
Title: What's In Your Field? Mapping Scientific Research with Knowledge Graphs and Large Language Models
Abstract:
The scientific literature's exponential growth makes it increasingly challenging to navigate and synthesize knowledge across disciplines. Large language models (LLMs) are powerful tools for understanding scientific text, but they fail to capture detailed relationships across large bodies of work. Unstructured approaches, like retrieval augmented generation, can sift through such corpora to recall relevant facts; however, when millions of facts influence the answer, unstructured approaches become cost prohibitive. Structured representations offer a natural complement -- enabling systematic analysis across the whole corpus. Recent work enhances LLMs with unstructured or semistructured representations of scientific concepts; to complement this, we try extracting structured representations using LLMs. By combining LLMs' semantic understanding with a schema of scientific concepts, we prototype a system that answers precise questions about the literature as a whole. Our schema applies across scientific fields and we extract concepts from it using only 20 manually annotated abstracts. To demonstrate the system, we extract concepts from 30,000 papers on arXiv spanning astrophysics, fluid dynamics, and evolutionary biology. The resulting database highlights emerging trends and, by visualizing the knowledge graph, offers new ways to explore the ever-growing landscape of scientific knowledge. Demo: abby101/surveyor-0 on HF Spaces. Code: https://github.com/chiral-carbon/kg-for-science.
中文摘要:科学文献的爆炸式增长使跨学科知识整合愈发困难,本研究通过将大语言模型与结构化概念框架相结合,构建了一个能够从数万篇论文中提取精确关联并揭示新兴趋势的知识图谱系统。
English Summary: The exponential growth of scientific literature challenges knowledge synthesis, but this work introduces a system combining large language models with structured concept schemas to extract precise relationships and emerging trends from thousands of papers across multiple disciplines.

Authors:Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, Wanxiang Che
Title: Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Abstract:
Recent advancements in reasoning with large language models (RLLMs), such as OpenAI-O1 and DeepSeek-R1, have demonstrated their impressive capabilities in complex domains like mathematics and coding. A central factor in their success lies in the application of long chain-of-thought (Long CoT) characteristics, which enhance reasoning abilities and enable the solution of intricate problems. However, despite these developments, a comprehensive survey on Long CoT is still lacking, limiting our understanding of its distinctions from traditional short chain-of-thought (Short CoT) and complicating ongoing debates on issues like "overthinking" and "inference-time scaling." This survey seeks to fill this gap by offering a unified perspective on Long CoT. (1) We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms. (2) Next, we explore the key characteristics of Long CoT: deep reasoning, extensive exploration, and feasible reflection, which enable models to handle more complex tasks and produce more efficient, coherent outcomes compared to the shallower Short CoT. (3) We then investigate key phenomena such as the emergence of Long CoT with these characteristics, including overthinking, and inference-time scaling, offering insights into how these processes manifest in practice. (4) Finally, we identify significant research gaps and highlight promising future directions, including the integration of multi-modal reasoning, efficiency improvements, and enhanced knowledge frameworks. By providing a structured overview, this survey aims to inspire future research and further the development of logical reasoning in artificial intelligence.
中文: 近期大语言模型推理能力的进步依赖于长思维链解决复杂任务,本综述通过区分长短思维链、探讨其关键特性、过度思考等现象及未来方向,填补研究空白,推动人工智能推理发展。
English: Recent advances in reasoning with large language models (RLLMs) leverage long chain-of-thought (Long CoT) to solve complex tasks, and this survey addresses the lack of comprehensive research by distinguishing Long CoT from short CoT, exploring its key traits, phenomena like overthinking, and future directions to advance AI reasoning.

Authors:Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, Jiawei Han
Title: Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Abstract:
Efficiently acquiring external knowledge and up-to-date information is essential for effective reasoning and text generation in large language models (LLMs). Prompting advanced LLMs with reasoning capabilities to use search engines during inference is often suboptimal, as the LLM might not fully possess the capability on how to interact optimally with the search engine. This paper introduces Search-R1, an extension of reinforcement learning (RL) for reasoning frameworks where the LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval. Search-R1 optimizes LLM reasoning trajectories with multi-turn search interactions, leveraging retrieved token masking for stable RL training and a simple outcome-based reward function. Experiments on seven question-answering datasets show that Search-R1 improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines under the same setting. This paper further provides empirical insights into RL optimization methods, LLM choices, and response length dynamics in retrieval-augmented reasoning. The code and model checkpoints are available at https://github.com/PeterGriffinJin/Search-R1.
中文: 本文提出Search-R1强化学习框架,使大语言模型能在逐步推理过程中自主生成搜索查询,在多个数据集上相比检索增强生成基线实现了最高达41%的性能提升。
English: This paper presents Search-R1, a reinforcement learning framework that enables large language models to autonomously generate search queries during reasoning, achieving significant performance improvements of up to 41% over retrieval-augmented generation baselines across multiple datasets.

Authors:Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen
Title: ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
Abstract:
Recent research on Reasoning of Large Language Models (LLMs) has sought to further enhance their performance by integrating meta-thinking -- enabling models to monitor, evaluate, and control their reasoning processes for more adaptive and effective problem-solving. However, current single-agent work lacks a specialized design for acquiring meta-thinking, resulting in low efficacy. To address this challenge, we introduce Reinforced Meta-thinking Agents (ReMA), a novel framework that leverages Multi-Agent Reinforcement Learning (MARL) to elicit meta-thinking behaviors, encouraging LLMs to think about thinking. ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions. Through iterative reinforcement learning with aligned objectives, these agents explore and learn collaboration, leading to improved generalization and robustness. Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend ReMA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the meta-thinking reasoning process enhances the reasoning capabilities of LLMs. Our code can be found in https://github.com/ziyuwan/ReMA-public
中文: ReMA框架通过多智能体强化学习将元思考与推理执行分离,有效提升大语言模型在复杂任务中的表现,并通过分层协作机制增强泛化能力。
English: The ReMA framework introduces a multi-agent reinforcement learning approach to enhance large language models' reasoning by decoupling meta-thinking and execution, significantly improving performance on complex tasks.

Authors:Jiushen Cai, Weihang Zhang, Hanruo Liu, Ningli Wang, Huiqi Li
Title: RetSTA: An LLM-Based Approach for Standardizing Clinical Fundus Image Reports
Abstract:
Standardization of clinical reports is crucial for improving the quality of healthcare and facilitating data integration. The lack of unified standards, including format, terminology, and style, is a great challenge in clinical fundus diagnostic reports, which increases the difficulty for large language models (LLMs) to understand the data. To address this, we construct a bilingual standard terminology, containing fundus clinical terms and commonly used descriptions in clinical diagnosis. Then, we establish two models, RetSTA-7B-Zero and RetSTA-7B. RetSTA-7B-Zero, fine-tuned on an augmented dataset simulating clinical scenarios, demonstrates powerful standardization behaviors. However, it encounters a challenge of limitation to cover a wider range of diseases. To further enhance standardization performance, we build RetSTA-7B, which integrates a substantial amount of standardized data generated by RetSTA-7B-Zero along with corresponding English data, covering diverse complex clinical scenarios and achieving report-level standardization for the first time. Experimental results demonstrate that RetSTA-7B outperforms other compared LLMs in bilingual standardization task, which validates its superior performance and generalizability. The checkpoints are available at https://github.com/AB-Story/RetSTA-7B.
Chinese: 本研究针对临床眼底报告缺乏标准化的问题,开发了双语模型RetSTA-7B,该模型通过整合标准化数据,在报告级标准化任务中表现出优于其他大语言模型的性能。
English: This study addresses the lack of standardization in clinical fundus reports by developing RetSTA-7B, a bilingual model that integrates standardized data to achieve superior performance in report-level standardization compared to other large language models.

Authors:Zhoutong Ye, Mingze Sun, Huan-ang Gao, Chun Yu, Yuanchun Shi
Title: MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding
Abstract:
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. However, there remains a significant gap between state-of-the-art LMMs and human performance when it comes to complex tasks that require a combination of fundamental VL capabilities, as well as tasks involving the grounding of complex instructions. To thoroughly investigate the human-LMM gap and its underlying causes, we propose MOAT, a diverse benchmark with complex real-world VL tasks that are challenging for LMMs. Specifically, the tasks in MOAT require LMMs to engage in generalist problem solving by integrating fundamental VL capabilities such as reading text, counting, understanding spatial relations, grounding textual and visual instructions, etc. All these abilities fit into a taxonomy proposed by us that contains 10 fundamental VL capabilities, enabling MOAT to provide a fine-grained view of LMMs' strengths and weaknesses. Besides, MOAT is the first benchmark to explicitly evaluate LMMs' ability to ground complex text and visual instructions, which is essential to many real-world applications. We evaluate over 20 proprietary and open source LMMs, as well as humans, on MOAT, and found that humans achieved 82.7% accuracy while the best performing LMM (OpenAI o1) achieved only 38.8%. To guide future model development, we analyze common trends in our results and discuss the underlying causes of observed performance gaps between LMMs and humans, focusing on which VL capability forms the bottleneck in complex tasks, whether test time scaling improves performance on MOAT, and how tiling harms LMMs' capability to count. Code and data are available at https://cambrian-yzt.github.io/MOAT.
大型多模态模型在视觉语言任务中展现出潜力,但在需要整合多种基础能力的复杂任务上仍与人类表现存在显著差距,MOAT基准测试显示人类准确率达82.7%而最佳模型仅为38.8%。
Large multimodal models show promise in vision-language tasks but still lag significantly behind human performance on complex tasks requiring integrated capabilities, as demonstrated by the MOAT benchmark where humans achieved 82.7% accuracy versus the top model's 38.8%.

Authors:Falko Helm, Nico Daheim, Iryna Gurevych
Title: Token Weighting for Long-Range Language Modeling
Abstract:
Many applications of large language models (LLMs) require long-context understanding, but models continue to struggle with such tasks. We hypothesize that conventional next-token prediction training could contribute to this, because each token is assigned equal weight. Yet, intuitively, the amount of context needed to predict the next token accurately varies greatly across different data. To reflect this, we propose various novel token-weighting schemes that assign different weights to each training token in the loss, thereby generalizing existing works. For this, we categorize token-weighting methods using a two-step framework which compares the confidences of a long-context and short-context model to score tokens. We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful to improve the long-context abilities of LLMs. Different short-context models can be used effectively for token scoring, including models that are much smaller than the long-context model that is trained. All in all, this work contributes to a better understanding of the trade-offs long-context language modeling faces and provides guidelines for model steering via loss-weighting based on empirical evidence. The code can be found on Github.
中文: 本研究提出新颖的令牌加权方案,通过在训练中分配不同的损失权重来增强大语言模型的长上下文理解能力,并通过实证验证在多任务中实现了性能提升。
English: This study proposes novel token-weighting schemes that assign varying loss weights during training to enhance large language models' long-context understanding, demonstrating improved performance across multiple tasks through empirical validation.

Authors:Zihao Chen, Hisashi Handa, Miho Ohsaki, Kimiaki Shirahama
Title: Domain Adaptation for Japanese Sentence Embeddings with Contrastive Learning based on Synthetic Sentence Generation
Abstract:
Several backbone models pre-trained on general domain datasets can encode a sentence into a widely useful embedding. Such sentence embeddings can be further enhanced by domain adaptation that adapts a backbone model to a specific domain. However, domain adaptation for low-resource languages like Japanese is often difficult due to the scarcity of large-scale labeled datasets. To overcome this, this paper introduces SDJC (Self-supervised Domain adaptation for Japanese sentence embeddings with Contrastive learning) that utilizes a data generator to generate sentences, which have the same syntactic structure to a sentence in an unlabeled specific domain corpus but convey different semantic meanings. Generated sentences are then used to boost contrastive learning that adapts a backbone model to accurately discriminate sentences in the specific domain. In addition, the components of SDJC like a backbone model and a method to adapt it need to be carefully selected, but no benchmark dataset is available for Japanese. Thus, a comprehensive Japanese STS (Semantic Textual Similarity) benchmark dataset is constructed by combining datasets machine-translated from English with existing datasets. The experimental results validates the effectiveness of SDJC on two domain-specific downstream tasks as well as the usefulness of the constructed dataset. Datasets, codes and backbone models adapted by SDJC are available on our github repository https://github.com/ccilab-doshisha/SDJC.
中文摘要:本文提出SDJC方法,通过对比学习和合成数据生成实现日语领域句子嵌入的自监督适应,并构建了基准数据集来评估其有效性。
English Summary: This paper presents SDJC, a self-supervised method using contrastive learning and synthetic data generation to adapt sentence embedding models for Japanese domains, while also creating a benchmark dataset to evaluate these adaptations.

Authors:Zhaoling Chen, Xiangru Tang, Gangda Deng, Fang Wu, Jialong Wu, Zhiwei Jiang, Viktor Prasanna, Arman Cohan, Xingyao Wang
Title: LocAgent: Graph-Guided LLM Agents for Code Localization
Abstract:
Code localization--identifying precisely where in a codebase changes need to be made--is a fundamental yet challenging task in software maintenance. Existing approaches struggle to efficiently navigate complex codebases when identifying relevant code sections. The challenge lies in bridging natural language problem descriptions with the appropriate code elements, often requiring reasoning across hierarchical structures and multiple dependencies. We introduce LocAgent, a framework that addresses code localization through graph-based representation. By parsing codebases into directed heterogeneous graphs, LocAgent creates a lightweight representation that captures code structures (files, classes, functions) and their dependencies (imports, invocations, inheritance), enabling LLM agents to effectively search and locate relevant entities through powerful multi-hop reasoning. Experimental results on real-world benchmarks demonstrate that our approach significantly enhances accuracy in code localization. Notably, our method with the fine-tuned Qwen-2.5-Coder-Instruct-32B model achieves comparable results to SOTA proprietary models at greatly reduced cost (approximately 86% reduction), reaching up to 92.7% accuracy on file-level localization while improving downstream GitHub issue resolution success rates by 12% for multiple attempts (Pass@10). Our code is available at https://github.com/gersteinlab/LocAgent.
中文: LocAgent通过基于图的框架将自然语言问题描述与代码元素精准关联,利用多跳推理实现高效的代码定位,在显著降低成本的同时大幅提升了定位准确率。
English: LocAgent introduces a graph-based framework that efficiently maps natural language problem descriptions to relevant code sections through multi-hop reasoning, achieving high accuracy in code localization with significantly reduced costs.

Authors:Xiwen Chen, Wenhui Zhu, Peijie Qiu, Hao Wang, Huayu Li, Haiyu Wu, Aristeidis Sotiras, Yalin Wang, Abolfazl Razi
Title: Prompt-OT: An Optimal Transport Regularization Paradigm for Knowledge Preservation in Vision-Language Model Adaptation
Abstract:
Vision-language models (VLMs) such as CLIP demonstrate strong performance but struggle when adapted to downstream tasks. Prompt learning has emerged as an efficient and effective strategy to adapt VLMs while preserving their pre-trained knowledge. However, existing methods still lead to overfitting and degrade zero-shot generalization. To address this challenge, we propose an optimal transport (OT)-guided prompt learning framework that mitigates forgetting by preserving the structural consistency of feature distributions between pre-trained and fine-tuned models. Unlike conventional point-wise constraints, OT naturally captures cross-instance relationships and expands the feasible parameter space for prompt tuning, allowing a better trade-off between adaptation and generalization. Our approach enforces joint constraints on both vision and text representations, ensuring a holistic feature alignment. Extensive experiments on benchmark datasets demonstrate that our simple yet effective method can outperform existing prompt learning strategies in base-to-novel generalization, cross-dataset evaluation, and domain generalization without additional augmentation or ensemble techniques. The code is available at https://github.com/ChongQingNoSubway/Prompt-OT
中文摘要:本文提出了一种基于最优传输的提示学习框架,通过保持特征分布一致性来增强视觉语言模型的适应能力,无需额外技术即可在多项任务中实现卓越的泛化性能。
English Summary: This paper introduces an optimal transport-guided prompt learning framework to enhance vision-language model adaptation by preserving feature distribution consistency, achieving superior generalization across various tasks without extra techniques.

Authors:Zhiwen You, Yue Guo
Title: PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization
Abstract:
Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact's effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents the first evaluation metric designed for PLS factual consistency evaluation, providing the community with both a robust benchmark and a practical tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact
Chinese: 针对大语言模型在医学领域产生幻觉内容对非专业受众的风险,我们提出了PlainQAFact评估指标,通过先分类句子类型再应用检索增强的问答评分方法,在评估医学通俗化摘要的事实一致性方面全面优于现有方法。
English: Large language models' medical hallucinations pose risks to lay audiences, prompting the development of PlainQAFact, a novel metric that outperforms existing methods in evaluating factual consistency in plain language medical summaries by first classifying sentence types and applying retrieval-augmented QA scoring.

Authors:Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen
Title: Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
Abstract:
Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available at https://github.com/WhyDwelledOnAi/Perplexity-Trap.
Chinese: PLM检索模型因基于困惑度的排序而偏好LLM生成内容,为此提出的CDC方法通过诊断和修正偏差效应,在多个领域实现了有效的去偏处理。
English: PLM-based retrieval models exhibit source bias by favoring LLM-generated content due to perplexity-based ranking, prompting the development of the CDC method that effectively diagnoses and corrects this bias across multiple domains.

Authors:Viktor Moskvoretskii, Chris Biemann, Irina Nikishina
Title: Self-Taught Self-Correction for Small Language Models
Abstract:
Although large language models (LLMs) have achieved remarkable performance across various tasks, they remain prone to errors. A key challenge is enabling them to self-correct. While prior research has relied on external tools or large proprietary models, this work explores self-correction in small language models (SLMs) through iterative fine-tuning using solely self-generated data. We introduce the Self-Taught Self-Correction (STaSC) algorithm, which incorporates multiple algorithmic design choices. Experimental results on a question-answering task demonstrate that STaSC effectively learns self-correction, leading to significant performance improvements. Our analysis further provides insights into the mechanisms of self-correction and the impact of different design choices on learning dynamics and overall performance. To support future research, we release our user-friendly codebase and lightweight models.
Chinese: 本研究提出自我教学式修正(STaSC)算法,通过仅使用自生成数据进行迭代微调,使小型语言模型实现自我纠错,在问答任务上显著提升性能,并为自我修正机制及设计选择的影响提供了深入见解。
English: This study introduces the Self-Taught Self-Correction (STaSC) algorithm, enabling small language models to self-correct through iterative fine-tuning with self-generated data, significantly improving performance on question-answering tasks while providing insights into self-correction mechanisms.

Authors:Zekun Li, Shinda Huang, Jiangtian Wang, Nathan Zhang, Antonis Antoniades, Wenyue Hua, Kaijie Zhu, Sirui Zeng, Chi Wang, William Yang Wang, Xifeng Yan
Title: SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints
Abstract:
As language agents increasingly automate critical tasks, their ability to follow domain-specific standard operating procedures (SOPs), policies, and constraints when taking actions and making tool calls becomes essential yet remains underexplored. To address this gap, we develop an automated evaluation pipeline SOPBench with: (1) executable environments containing 167 tools/functions across seven customer service domains with service-specific SOPs and rule-based verifiers, (2) an automated test generation framework producing over 900 verified test cases, and (3) an automated evaluation framework to rigorously assess agent adherence from multiple dimensions. Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions. The original code serves as oracle rule-based verifiers to assess compliance, reducing reliance on manual annotations and LLM-based evaluations. We evaluate 18 leading models, and results show the task is challenging even for top-tier models (like GPT-4o, Claude-3.7-Sonnet), with variances across domains. Reasoning models like o4-mini-high show superiority while other powerful models perform less effectively (pass rates of 30%-50%), and small models (7B, 8B) perform significantly worse. Additionally, language agents can be easily jailbroken to overlook SOPs and constraints. Code, data, and over 24k agent trajectories are released at https://github.com/Leezekun/SOPBench.
中文: SOPBench开发了一套自动化评估体系,通过七个客服领域的167种工具和900多个测试用例系统检验语言智能体对标准流程的遵循能力,研究发现即使顶尖模型也存在明显性能差异且易被突破规则限制。
English: SOPBench introduces an automated evaluation pipeline to rigorously test language agents' adherence to domain-specific procedures across seven service domains, revealing significant performance gaps even among top models while exposing vulnerabilities to constraint violations.

Authors:Jiale Wei, Xiang Ying, Tao Gao, Fangyi Bao, Felix Tao, Jingbo Shang
Title: AI-native Memory 2.0: Second Me
Abstract:
Human interaction with the external world fundamentally involves the exchange of personal memory, whether with other individuals, websites, applications, or, in the future, AI agents. A significant portion of this interaction is redundant, requiring users to repeatedly provide the same information across different contexts. Existing solutions, such as browser-stored credentials, autofill mechanisms, and unified authentication systems, have aimed to mitigate this redundancy by serving as intermediaries that store and retrieve commonly used user data. The advent of large language models (LLMs) presents an opportunity to redefine memory management through an AI-native paradigm: SECOND ME. SECOND ME acts as an intelligent, persistent memory offload system that retains, organizes, and dynamically utilizes user-specific knowledge. By serving as an intermediary in user interactions, it can autonomously generate context-aware responses, prefill required information, and facilitate seamless communication with external systems, significantly reducing cognitive load and interaction friction. Unlike traditional memory storage solutions, SECOND ME extends beyond static data retention by leveraging LLM-based memory parameterization. This enables structured organization, contextual reasoning, and adaptive knowledge retrieval, facilitating a more systematic and intelligent approach to memory management. As AI-driven personal agents like SECOND ME become increasingly integrated into digital ecosystems, SECOND ME further represents a critical step toward augmenting human-world interaction with persistent, contextually aware, and self-optimizing memory systems. We have open-sourced the fully localizable deployment system at GitHub: https://github.com/Mindverse/Second-Me.
中文摘要:SECOND ME系统利用大语言模型实现AI原生的记忆管理,通过智能存储、组织和动态运用个人信息,有效减少人类与外部世界交互中的重复操作,提升交互效率。
English Summary: The SECOND ME system leverages large language models to create an AI-native memory management solution that reduces redundancy in human-world interactions by intelligently storing, organizing, and dynamically utilizing personal information across various contexts.

Authors:Ali Veisi, Hamidreza Amirzadeh, Amir Mansourian
Title: Context-aware Biases for Length Extrapolation
Abstract:
Transformers often struggle to generalize to longer sequences than those seen during training, a limitation known as length extrapolation. Most existing Relative Positional Encoding (RPE) methods attempt to address this by introducing either fixed linear biases or globally learned biases, which lack the capacity to adapt to different input contexts. In this work, we propose an additive RPE, Context-Aware Biases for Length Extrapolation (CABLE), a method that learns token-specific, context-aware biases for each attention head in transformers. By dynamically adjusting positional biases based on the input sequence, CABLE overcomes the rigidity of fixed RPEs. When evaluated on sequences longer than originally trained with, GPT-2 Medium (334M parameters) with CABLE achieves lower perplexity than counterparts using other widely adopted positional encoding methods. Additionally, by applying CABLE to the BERT base model we improved performance in long-context retrieval tasks. Our method significantly enhances the extrapolation performance of existing RPE methods tested on the FineWeb-Edu-10B and WikiText-103 datasets. Our code is available at: https://github.com/AlgonetLabs/Cable.
中文: 本文提出CABLE方法,一种上下文感知的相对位置编码技术,能根据输入序列动态调整位置偏置,显著提升了Transformer模型在训练未见过的长序列上的泛化性能。
English: The paper introduces CABLE, a context-aware relative positional encoding method that dynamically adjusts biases based on input sequences, significantly improving transformer models' performance on longer sequences than seen during training.

Authors:Ying Fu Lim, Jiawen Zhu, Guansong Pang
Title: Adapting Large Language Models for Parameter-Efficient Log Anomaly Detection
Abstract:
Log Anomaly Detection (LAD) seeks to identify atypical patterns in log data that are crucial to assessing the security and condition of systems. Although Large Language Models (LLMs) have shown tremendous success in various fields, the use of LLMs in enabling the detection of log anomalies is largely unexplored. This work aims to fill this gap. Due to the prohibitive costs involved in fully fine-tuning LLMs, we explore the use of parameter-efficient fine-tuning techniques (PEFTs) for adapting LLMs to LAD. To have an in-depth exploration of the potential of LLM-driven LAD, we present a comprehensive investigation of leveraging two of the most popular PEFTs -- Low-Rank Adaptation (LoRA) and Representation Fine-tuning (ReFT) -- to tap into three prominent LLMs of varying size, including RoBERTa, GPT-2, and Llama-3, for parameter-efficient LAD. Comprehensive experiments on four public log datasets are performed to reveal important insights into effective LLM-driven LAD in several key perspectives, including the efficacy of these PEFT-based LLM-driven LAD methods, their stability, sample efficiency, robustness w.r.t. unstable logs, and cross-dataset generalization. Code is available at https://github.com/mala-lab/LogADReft.
中文: 本研究探索使用LoRA和ReFT等参数高效微调技术,将大语言模型适配于日志异常检测任务,并在多个数据集上从效能、稳定性等关键维度评估其性能。
English: This study explores parameter-efficient fine-tuning techniques like LoRA and ReFT to adapt large language models for log anomaly detection, evaluating their effectiveness across multiple datasets and key performance aspects.

Authors:Samuel Cahyawijaya, Holy Lovenia, Joel Ruben Antony Moniz, Tack Hwa Wong, Mohammad Rifqi Farhansyah, Thant Thiri Maung, Frederikus Hudi, David Anugraha, Muhammad Ravi Shulthan Habibi, Muhammad Reza Qorib, Amit Agarwal, Joseph Marvin Imperial, Hitesh Laxmichand Patel, Vicky Feliren, Bahrul Ilmi Nasution, Manuel Antonio Rufino, Genta Indra Winata, Rian Adam Rajagede, Carlos Rafael Catalan, Mohamed Fazli Imam, Priyaranjan Pattnayak, Salsabila Zahirah Pranida, Kevin Pratama, Yeshil Bangera, Adisai Na-Thalang, Patricia Nicole Monderin, Yueqi Song, Christian Simon, Lynnette Hui Xian Ng, Richardy Lobo' Sapan, Taki Hasan Rafi, Bin Wang, Supryadi, Kanyakorn Veerakanjana, Piyalitt Ittichaiwong, Matthew Theodore Roque, Karissa Vincentio, Takdanai Kreangphet, Phakphum Artkaew, Kadek Hendrawan Palgunadi, Yanzhi Yu, Rochana Prih Hastuti, William Nixon, Mithil Bangera, Adrian Xuan Wei Lim, Aye Hninn Khine, Hanif Muhammad Zhafran, Teddy Ferdinan, Audra Aurora Izzani, Ayushman Singh, Evan, Jauza Akbar Krito, Michael Anugraha, Fenal Ashokbhai Ilasariya, Haochen Li, John Amadeo Daniswara, Filbert Aurelian Tjiaranata, Eryawan Presma Yulianrifat, Can Udomcharoenchaikit, Fadil Risdian Ansori, Mahardika Krisna Ihsani, Giang Nguyen, Anab Maulana Barik, Dan John Velasco, Rifo Ahmad Genadi, Saptarshi Saha, Chengwei Wei, Isaiah Flores, Kenneth Ko Han Chen, Anjela Gail Santos, Wan Shen Lim, Kaung Si Phyo, Tim Santos, Meisyarah Dwiastuti, Jiayun Luo, Jan Christian Blaise Cruz, Ming Shan Hee, Ikhlasul Akmal Hanif, M. Alif Al Hakim, Muhammad Rizky Sya'ban, Kun Kerdthaisong, Lester James V. Miranda, Fajri Koto, Tirana Noor Fatyanosa, Alham Fikri Aji, Jostin Jerico Rosal, Jun Kevin, Robert Wijaya, Onno P. Kampman, Ruochen Zhang, Börje F. Karlsson, Peerat Limkonchotiwat
Title: Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia
Abstract:
Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.
中文摘要:SEA-VL是一个开源项目,旨在通过众包和自动化方法收集东南亚文化相关数据,以弥补该地区在视觉语言研究中的代表性不足,最终汇集了128万张图像,推动构建更具包容性的人工智能系统。
English Summary: SEA-VL is an open-source initiative addressing the underrepresentation of Southeast Asian cultures in vision-language research by creating culturally relevant datasets through crowdsourcing and automated methods, ultimately gathering 1.28M images to foster more inclusive AI systems.

Authors:Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, Liang Lin
Title: Cross-modal Causal Relation Alignment for Video Question Grounding
Abstract:
Video question grounding (VideoQG) requires models to answer the questions and simultaneously infer the relevant video segments to support the answers. However, existing VideoQG methods usually suffer from spurious cross-modal correlations, leading to a failure to identify the dominant visual scenes that align with the intended question. Moreover, vision-language models exhibit unfaithful generalization performance and lack robustness on challenging downstream tasks such as VideoQG. In this work, we propose a novel VideoQG framework named Cross-modal Causal Relation Alignment (CRA), to eliminate spurious correlations and improve the causal consistency between question-answering and video temporal grounding. Our CRA involves three essential components: i) Gaussian Smoothing Grounding (GSG) module for estimating the time interval via cross-modal attention, which is de-noised by an adaptive Gaussian filter, ii) Cross-Modal Alignment (CMA) enhances the performance of weakly supervised VideoQG by leveraging bidirectional contrastive learning between estimated video segments and QA features, iii) Explicit Causal Intervention (ECI) module for multimodal deconfounding, which involves front-door intervention for vision and back-door intervention for language. Extensive experiments on two VideoQG datasets demonstrate the superiority of our CRA in discovering visually grounded content and achieving robust question reasoning. Codes are available at https://github.com/WissingChen/CRA-GQA.
中文摘要:本文提出的跨模态因果关联对齐(CRA)框架通过高斯平滑定位、跨模态对齐和显式因果干预三大模块,有效消除视频问答任务中的伪相关性,显著提升了时序定位精度与问答推理的鲁棒性。
English Summary: The proposed Cross-modal Causal Relation Alignment (CRA) framework addresses spurious correlations in Video Question Grounding by integrating Gaussian smoothing, cross-modal alignment, and explicit causal intervention to improve temporal localization and reasoning robustness.

Authors:Jen-tse Huang, Jiantong Qin, Jianping Zhang, Youliang Yuan, Wenxuan Wang, Jieyu Zhao
Title: VisBias: Measuring Explicit and Implicit Social Biases in Vision Language Models
Abstract:
This research investigates both explicit and implicit social biases exhibited by Vision-Language Models (VLMs). The key distinction between these bias types lies in the level of awareness: explicit bias refers to conscious, intentional biases, while implicit bias operates subconsciously. To analyze explicit bias, we directly pose questions to VLMs related to gender and racial differences: (1) Multiple-choice questions based on a given image (e.g., "What is the education level of the person in the image?") (2) Yes-No comparisons using two images (e.g., "Is the person in the first image more educated than the person in the second image?") For implicit bias, we design tasks where VLMs assist users but reveal biases through their responses: (1) Image description tasks: Models are asked to describe individuals in images, and we analyze disparities in textual cues across demographic groups. (2) Form completion tasks: Models draft a personal information collection form with 20 attributes, and we examine correlations among selected attributes for potential biases. We evaluate Gemini-1.5, GPT-4V, GPT-4o, LLaMA-3.2-Vision and LLaVA-v1.6. Our code and data are publicly available at https://github.com/uscnlp-lime/VisBias.
本研究通过直接提问和间接任务测试视觉语言模型在性别和种族方面的显性与隐性社会偏见,评估了包括Gemini-1.5和GPT-4V在内的多个模型。
This study examines explicit and implicit social biases in Vision-Language Models by testing them with direct questions and indirect tasks related to gender and race, evaluating models like Gemini-1.5 and GPT-4V.

Authors:Yash Akhauri, Ahmed F AbouElhamayed, Yifei Gao, Chi-Chih Chang, Nilesh Jain, Mohamed S. Abdelfattah
Title: TokenButler: Token Importance is Predictable
Abstract:
Large Language Models (LLMs) rely on the Key-Value (KV) Cache to store token history, enabling efficient decoding of tokens. As the KV-Cache grows, it becomes a major memory and computation bottleneck, however, there is an opportunity to alleviate this bottleneck, especially because prior research has shown that only a small subset of tokens contribute meaningfully to each decoding step. A key challenge in finding these critical tokens is that they are dynamic, and heavily input query-dependent. Existing methods either risk quality by evicting tokens permanently, or retain the full KV-Cache but rely on retrieving chunks (pages) of tokens at generation, failing at dense, context-rich tasks. Additionally, many existing KV-Cache sparsity methods rely on inaccurate proxies for token importance. To address these limitations, we introduce TokenButler, a high-granularity, query-aware predictor that learns to identify these critical tokens. By training a light-weight predictor with less than 1.2% parameter overhead, TokenButler prioritizes tokens based on their contextual, predicted importance. This improves perplexity & downstream accuracy by over 8% relative to SoTA methods for estimating token importance. We evaluate TokenButler on a novel synthetic small-context co-referential retrieval task, demonstrating near-oracle accuracy. Code, models and benchmarks: https://github.com/abdelfattah-lab/TokenButler
中文: TokenButler 提出了一种轻量级、查询感知的预测器,能动态识别KV缓存中的关键令牌,相比现有方法,在效率和准确率上提升了超过8%。
English: TokenButler introduces a lightweight, query-aware predictor that dynamically identifies critical tokens in the KV-Cache, significantly improving efficiency and accuracy over existing methods by over 8%.

Authors:Xiangru Tang, Daniel Shao, Jiwoong Sohn, Jiapeng Chen, Jiayi Zhang, Jinyu Xiang, Fang Wu, Yilun Zhao, Chenglin Wu, Wenqi Shi, Arman Cohan, Mark Gerstein
Title: MedAgentsBench: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning
Abstract:
Large Language Models (LLMs) have shown impressive performance on existing medical question-answering benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentsBench, a benchmark that focuses on challenging medical questions requiring multi-step clinical reasoning, diagnosis formulation, and treatment planning-scenarios where current models still struggle despite their strong performance on standard tests. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: (1) the prevalence of straightforward questions where even base models achieve high performance, (2) inconsistent sampling and evaluation protocols across studies, and (3) lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI o3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods offer promising performance-to-cost ratios compared to traditional approaches. Our analysis reveals substantial performance gaps between model families on complex questions and identifies optimal model selections for different computational constraints. Our benchmark and evaluation framework are publicly available at https://github.com/gersteinlab/medagents-benchmark.
中文: MedAgentsBench是一个针对复杂医学推理任务的新基准,旨在解决现有评估的局限性,并揭示先进模型在需要多步骤临床推理的难题上的性能差异。
English: MedAgentsBench is a new benchmark that evaluates LLMs on complex medical reasoning tasks where current models still struggle, addressing limitations in existing evaluations and revealing performance gaps among advanced models.

Authors:Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, Li Yuan
Title: WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Abstract:
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text to image generation. To address this challenge, we propose $\textbf{WISE}$, the first benchmark specifically designed for $\textbf{W}$orld Knowledge-$\textbf{I}$nformed $\textbf{S}$emantic $\textbf{E}$valuation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal reasoning, and natural science. To overcome the limitations of traditional CLIP metric, we introduce $\textbf{WiScore}$, a novel quantitative metric for assessing knowledge-image alignment. Through comprehensive testing of 20 models (10 dedicated T2I models and 10 unified multimodal models) using 1,000 structured prompts spanning 25 subdomains, our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models. Code and data are available at https://github.com/PKU-YuanGroup/WISE.
中文: WISE基准通过引入涵盖25个领域的结构化提示和创新的WiScore评估指标,解决了文本到图像模型在复杂语义理解与世界知识整合方面的评估空白,揭示了现有模型在知识应用上的显著不足。
English: The WISE benchmark addresses the gap in evaluating complex semantic understanding and world knowledge integration in text-to-image models by introducing structured prompts across 25 domains and a novel WiScore metric, revealing significant limitations in current models' knowledge application.

Authors:Junhao Zhang, Richong Zhang, Fanshuang Kong, Ziyang Miao, Yanhan Ye, Yaowei Zheng
Title: Lost-in-the-Middle in Long-Text Generation: Synthetic Dataset, Evaluation Framework, and Mitigation
Abstract:
Existing long-text generation methods primarily concentrate on producing lengthy texts from short inputs, neglecting the long-input and long-output tasks. Such tasks have numerous practical applications while lacking available benchmarks. Moreover, as the input grows in length, existing methods inevitably encounter the "lost-in-the-middle" phenomenon. In this paper, we first introduce a Long Input and Output Benchmark (LongInOutBench), including a synthetic dataset and a comprehensive evaluation framework, addressing the challenge of the missing benchmark. We then develop the Retrieval-Augmented Long-Text Writer (RAL-Writer), which retrieves and restates important yet overlooked content, mitigating the "lost-in-the-middle" issue by constructing explicit prompts. We finally employ the proposed LongInOutBench to evaluate our RAL-Writer against comparable baselines, and the results demonstrate the effectiveness of our approach. Our code has been released at https://github.com/OnlyAR/RAL-Writer.
中文: 本文提出了长输入输出基准LongInOutBench,并开发了RAL-Writer方法,通过检索和重述被忽略内容来解决"中间丢失"问题,评估结果验证了该方法的有效性。
English: This paper introduces LongInOutBench, a benchmark for long-input and long-output text generation tasks, and proposes RAL-Writer, a method that retrieves and restates overlooked content to address the "lost-in-the-middle" problem, demonstrating its effectiveness through evaluation.

Authors:Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, Shaohui Lin
Title: Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Abstract:
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .
中文: Vision-R1通过构建无需人工标注的20万规模多模态思维链数据集,并采用渐进式训练策略,有效提升了多模态数学推理能力,在多个基准测试中表现优异。
English: Vision-R1 is a multimodal reasoning model that enhances reasoning capabilities by creating a high-quality 200K multimodal CoT dataset without human annotation and employing progressive training strategies, achieving significant improvements on math reasoning benchmarks.

Authors:Ming Zhang, Yuhui Wang, Yujiong Shen, Tingyi Yang, Changhao Jiang, Yilong Wu, Shihan Dou, Qinhao Chen, Zhiheng Xi, Zhihao Zhang, Yi Dong, Zhen Wang, Zhihui Fei, Mingyang Wan, Tao Liang, Guojun Ma, Qi Zhang, Tao Gui, Xuanjing Huang
Title: PFDial: A Structured Dialogue Instruction Fine-tuning Method Based on UML Flowcharts
Abstract:
Process-driven dialogue systems, which operate under strict predefined process constraints, are essential in customer service and equipment maintenance scenarios. Although Large Language Models (LLMs) have shown remarkable progress in dialogue and reasoning, they still struggle to solve these strictly constrained dialogue tasks. To address this challenge, we construct Process Flow Dialogue (PFDial) dataset, which contains 12,705 high-quality Chinese dialogue instructions derived from 440 flowcharts containing 5,055 process nodes. Based on PlantUML specification, each UML flowchart is converted into atomic dialogue units i.e., structured five-tuples. Experimental results demonstrate that a 7B model trained with merely 800 samples, and a 0.5B model trained on total data both can surpass 90% accuracy. Additionally, the 8B model can surpass GPT-4o up to 43.88% with an average of 11.00%. We further evaluate models' performance on challenging backward transitions in process flows and conduct an in-depth analysis of various dataset formats to reveal their impact on model performance in handling decision and sequential branches. The data is released in https://github.com/KongLongGeFDU/PFDial.
中文: PFDial数据集基于440个UML流程图构建了12,705条中文对话指令,实验表明仅用800样本训练的7B模型和全量训练的0.5B模型均能突破90%准确率,在流程驱动对话任务中最高可超越GPT-4o达43.88%。
English: The PFDial dataset, comprising 12,705 Chinese dialogue instructions derived from 440 UML flowcharts, enables small models like 7B and 0.5B to achieve over 90% accuracy in process-driven dialogue tasks, even surpassing GPT-4o by up to 43.88%.

Authors:Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Mengdi Zhang, Jian Shao, Yueting Zhuang
Title: InftyThink: Breaking the Length Limits of Long-Context Reasoning in Large Language Models
Abstract:
Advanced reasoning in large language models has achieved remarkable performance on challenging tasks, but the prevailing long-context reasoning paradigm faces critical limitations: quadratic computational scaling with sequence length, reasoning constrained by maximum context boundaries, and performance degradation beyond pre-training context windows. Existing approaches primarily compress reasoning chains without addressing the fundamental scaling problem. To overcome these challenges, we introduce InftyThink, a paradigm that transforms monolithic reasoning into an iterative process with intermediate summarization. By interleaving short reasoning segments with concise progress summaries, our approach enables unbounded reasoning depth while maintaining bounded computational costs. This creates a characteristic sawtooth memory pattern that significantly reduces computational complexity compared to traditional approaches. Furthermore, we develop a methodology for reconstructing long-context reasoning datasets into our iterative format, transforming OpenR1-Math into 333K training instances. Experiments across multiple model architectures demonstrate that our approach reduces computational costs while improving performance, with Qwen2.5-Math-7B showing 3-13% improvements across MATH500, AIME24, and GPQA_diamond benchmarks. Our work challenges the assumed trade-off between reasoning depth and computational efficiency, providing a more scalable approach to complex reasoning without architectural modifications.
中文: InftyThink通过中间摘要的迭代推理范式,在保持有限计算成本的同时实现无限推理深度,无需架构修改即可在多个基准测试中提升性能表现。
English: InftyThink introduces an iterative reasoning paradigm with intermediate summarization that enables unbounded reasoning depth while maintaining bounded computational costs, achieving performance improvements across multiple benchmarks without architectural modifications.

Authors:Jinmyeong An, Sangwon Ryu, Heejin Do, Yunsu Kim, Jungseul Ok, Gary Geunbae Lee
Title: Revisiting Early Detection of Sexual Predators via Turn-level Optimization
Abstract:
Online grooming is a severe social threat where sexual predators gradually entrap child victims with subtle and gradual manipulation. Therefore, timely intervention for online grooming is critical for proactive protection. However, previous methods fail to determine the optimal intervention points (i.e., jump to conclusions) as they rely on chat-level risk labels by causing weak supervision of risky utterances. For timely detection, we propose speed control reinforcement learning (SCoRL) (The code and supplementary materials are available at https://github.com/jinmyeongAN/SCoRL), incorporating a practical strategy derived from luring communication theory (LCT). To capture the predator's turn-level entrapment, we use a turn-level risk label based on the LCT. Then, we design a novel speed control reward function that balances the trade-off between speed and accuracy based on turn-level risk label; thus, SCoRL can identify the optimal intervention moment. In addition, we introduce a turn-level metric for precise evaluation, identifying limitations in previously used chat-level metrics. Experimental results show that SCoRL effectively preempted online grooming, offering a more proactive and timely solution. Further analysis reveals that our method enhances performance while intuitively identifying optimal early intervention points.
中文摘要:本研究提出的SCoRL方法通过引入话轮级风险评估和速度控制奖励机制,能主动识别网络诱骗的最佳干预时机,有效解决了以往基于聊天级标签方法导致的监管滞后问题。
English Summary: The proposed SCoRL method uses reinforcement learning with turn-level risk assessment to proactively identify optimal intervention moments in online grooming, overcoming previous methods' reliance on chat-level labels that caused delayed detection.

Authors:Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, Jingbo Zhu
Title: Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Abstract:
The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve $2.4 \sim 6.5 \times$ inference speedups and a $75\%$ reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.
中文摘要:本研究将大型语言模型与神经机器翻译相结合,开发出高效通用的翻译系统,在保持跨任务强泛化能力的同时,实现了更优的翻译质量、更快的推理速度和更低的内存占用。
English Summary: This study integrates large language models with neural machine translation to create efficient and universally applicable translation systems, achieving superior translation quality, faster inference speeds, and reduced memory usage while maintaining strong generalization across tasks.

Authors:Tatsuro Inaba, Kentaro Inui, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi
Title: How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders
Abstract:
Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.
中文: 大语言模型在训练过程中先习得各语言特有知识,随后建立跨语言关联,并从词汇层面学习逐步过渡到掌握更高层次的抽象概念。
English: Large Language Models initially develop language-specific knowledge and then learn cross-linguistic patterns, progressing from token-level information to higher-level abstract concepts during training.

Authors:Lexin Zhou, Lorenzo Pacchiardi, Fernando Martínez-Plumed, Katherine M. Collins, Yael Moros-Daval, Seraphina Zhang, Qinlin Zhao, Yitian Huang, Luning Sun, Jonathan E. Prunty, Zongqian Li, Pablo Sánchez-García, Kexin Jiang Chen, Pablo A. M. Casares, Jiyun Zu, John Burden, Behzad Mehrbakhsh, David Stillwell, Manuel Cebrian, Jindong Wang, Peter Henderson, Sherry Tongshuang Wu, Patrick C. Kyllonen, Lucy Cheke, Xing Xie, José Hernández-Orallo
Title: General Scales Unlock AI Evaluation with Explanatory and Predictive Power
Abstract:
Ensuring safe and effective use of AI requires understanding and anticipating its performance on novel tasks, from advanced scientific challenges to transformed workplace activities. So far, benchmarking has guided progress in AI, but it has offered limited explanatory and predictive power for general-purpose AI systems, given the low transferability across diverse tasks. In this paper, we introduce general scales for AI evaluation that can explain what common AI benchmarks really measure, extract ability profiles of AI systems, and predict their performance for new task instances, in- and out-of-distribution. Our fully-automated methodology builds on 18 newly-crafted rubrics that place instance demands on general scales that do not saturate. Illustrated for 15 large language models and 63 tasks, high explanatory power is unleashed from inspecting the demand and ability profiles, bringing insights on the sensitivity and specificity exhibited by different benchmarks, and how knowledge, metacognition and reasoning are affected by model size, chain-of-thought and distillation. Surprisingly, high predictive power at the instance level becomes possible using these demand levels, providing superior estimates over black-box baseline predictors based on embeddings or finetuning, especially in out-of-distribution settings (new tasks and new benchmarks). The scales, rubrics, battery, techniques and results presented here represent a major step for AI evaluation, underpinning the reliable deployment of AI in the years ahead. (Collaborative platform: https://kinds-of-intelligence-cfi.github.io/ADELE.)
中文: 本文提出了通用的人工智能评估量表,通过分析需求和能力概况,增强了评估的解释和预测能力,利用自动化量规和优化的性能预测,特别是在分布外场景中,确保人工智能的可靠部署。
English: This paper introduces general scales for AI evaluation that enhance explanatory and predictive power by analyzing demand and ability profiles, enabling reliable deployment through automated rubrics and improved performance forecasting, especially in out-of-distribution scenarios.

Authors:Yixin Wu, Feiran Zhang, Tianyuan Shi, Ruicheng Yin, Zhenghua Wang, Zhenliang Gan, Xiaohua Wang, Changze Lv, Xiaoqing Zheng, Xuanjing Huang
Title: Explainable Synthetic Image Detection through Diffusion Timestep Ensembling
Abstract:
Recent advances in diffusion models have enabled the creation of deceptively real images, posing significant security risks when misused. In this study, we empirically show that different timesteps of DDIM inversion reveal varying subtle distinctions between synthetic and real images that are extractable for detection, in the forms of such as Fourier power spectrum high-frequency discrepancies and inter-pixel variance distributions. Based on these observations, we propose a novel synthetic image detection method that directly utilizes features of intermediately noised images by training an ensemble on multiple noised timesteps, circumventing conventional reconstruction-based strategies. To enhance human comprehension, we introduce a metric-grounded explanation generation and refinement module to identify and explain AI-generated flaws. Additionally, we construct the GenHard and GenExplain benchmarks to provide detection samples of greater difficulty and high-quality rationales for fake images. Extensive experiments show that our method achieves state-of-the-art performance with 98.91% and 95.89% detection accuracy on regular and challenging samples respectively, and demonstrates generalizability and robustness. Our code and datasets are available at https://github.com/Shadowlized/ESIDE.
中文: 本研究提出了一种新颖的合成图像检测方法,通过利用多时间步中间噪声图像特征,在常规样本和挑战性基准上分别实现了98.91%和95.89%的最优检测准确率,同时提供了可解释的AI生成缺陷识别功能。
English: This study introduces a novel synthetic image detection method that leverages intermediate noised image features across multiple timesteps, achieving state-of-the-art detection accuracy of 98.91% on regular samples and 95.89% on challenging benchmarks while providing explainable AI-generated flaw identification.

Authors:Hoang-Thang Ta, Anh Tran
Title: AF-KAN: Activation Function-Based Kolmogorov-Arnold Networks for Efficient Representation Learning
Abstract:
Kolmogorov-Arnold Networks (KANs) have inspired numerous works exploring their applications across a wide range of scientific problems, with the potential to replace Multilayer Perceptrons (MLPs). While many KANs are designed using basis and polynomial functions, such as B-splines, ReLU-KAN utilizes a combination of ReLU functions to mimic the structure of B-splines and take advantage of ReLU's speed. However, ReLU-KAN is not built for multiple inputs, and its limitations stem from ReLU's handling of negative values, which can restrict feature extraction. To address these issues, we introduce Activation Function-Based Kolmogorov-Arnold Networks (AF-KAN), expanding ReLU-KAN with various activations and their function combinations. This novel KAN also incorporates parameter reduction methods, primarily attention mechanisms and data normalization, to enhance performance on image classification datasets. We explore different activation functions, function combinations, grid sizes, and spline orders to validate the effectiveness of AF-KAN and determine its optimal configuration. In the experiments, AF-KAN significantly outperforms MLP, ReLU-KAN, and other KANs with the same parameter count. It also remains competitive even when using fewer than 6 to 10 times the parameters while maintaining the same network structure. However, AF-KAN requires a longer training time and consumes more FLOPs. The repository for this work is available at https://github.com/hoangthangta/All-KAN.
Chinese: AF-KAN提出了一种新型的科尔莫戈罗夫-阿诺德网络,采用多种激活函数和参数精简方法,在图像分类任务中以更少参数显著优于MLP和其他KAN,但需要更长的训练时间。
English: AF-KAN introduces a novel Kolmogorov-Arnold Network that utilizes various activation functions and parameter reduction methods, significantly outperforming MLPs and other KANs in image classification while requiring fewer parameters but longer training times.

Authors:Xiang Lan, Feng Wu, Kai He, Qinghao Zhao, Shenda Hong, Mengling Feng
Title: GEM: Empowering MLLM for Grounded ECG Understanding with Time Series and Images
Abstract:
While recent multimodal large language models (MLLMs) have advanced automated ECG interpretation, they still face two key limitations: (1) insufficient multimodal synergy between time series signals and visual ECG representations, and (2) limited explainability in linking diagnoses to granular waveform evidence. We introduce GEM, the first MLLM unifying ECG time series, 12-lead ECG images and text for grounded and clinician-aligned ECG interpretation. GEM enables feature-grounded analysis, evidence-driven reasoning, and a clinician-like diagnostic process through three core innovations: a dual-encoder framework extracting complementary time series and image features, cross-modal alignment for effective multimodal understanding, and knowledge-guided instruction generation for generating high-granularity grounding data (ECG-Grounding) linking diagnoses to measurable parameters ($e.g.$, QRS/PR Intervals). Additionally, we propose the Grounded ECG Understanding task, a clinically motivated benchmark designed to comprehensively assess the MLLM's capability in grounded ECG understanding. Experimental results on both existing and our proposed benchmarks show GEM significantly improves predictive performance (CSN $7.4\% \uparrow$), explainability ($22.7\% \uparrow$), and grounding ($24.8\% \uparrow$), making it more suitable for real-world clinical applications. GitHub repository: https://github.com/lanxiang1017/GEM.git
中文: GEM作为首个融合心电时间序列、12导联图像与文本的多模态大模型,通过跨模态特征对齐和知识引导实现了可解释的心电分析,显著提升了诊断性能和临床适用性。
English: GEM is a novel multimodal large language model that integrates ECG time series, images, and text to enhance diagnostic accuracy, explainability, and clinical alignment through cross-modal feature extraction and evidence-driven reasoning.

Authors:Wenjie Tang, Yuan Zhou, Erqiang Xu, Keyan Cheng, Minne Li, Liquan Xiao
Title: DSGBench: A Diverse Strategic Game Benchmark for Evaluating LLM-based Agents in Complex Decision-Making Environments
Abstract:
Large Language Model~(LLM) based agents have been increasingly popular in solving complex and dynamic tasks, which requires proper evaluation systems to assess their capabilities. Nevertheless, existing benchmarks usually either focus on single-objective tasks or use overly broad assessing metrics, failing to provide a comprehensive inspection of the actual capabilities of LLM-based agents in complicated decision-making tasks. To address these issues, we introduce DSGBench, a more rigorous evaluation platform for strategic decision-making. Firstly, it incorporates six complex strategic games which serve as ideal testbeds due to their long-term and multi-dimensional decision-making demands and flexibility in customizing tasks of various difficulty levels or multiple targets. Secondly, DSGBench employs a fine-grained evaluation scoring system which examines the decision-making capabilities by looking into the performance in five specific dimensions and offering a comprehensive assessment in a well-designed way. Furthermore, DSGBench also incorporates an automated decision-tracking mechanism which enables in-depth analysis of agent behaviour patterns and the changes in their strategies. We demonstrate the advances of DSGBench by applying it to multiple popular LLM-based agents and our results suggest that DSGBench provides valuable insights in choosing LLM-based agents as well as improving their future development. DSGBench is available at https://github.com/DeciBrain-Group/DSGBench.
中文: DSGBench是一个针对基于大语言模型的智能体推出的严格评估平台,它通过六种复杂策略游戏和细粒度评分系统,全面检验多维度决策能力,并利用自动化追踪机制深入分析行为模式。
English: DSGBench is introduced as a rigorous evaluation platform for LLM-based agents, featuring six complex strategic games and a fine-grained scoring system to comprehensively assess decision-making capabilities across multiple dimensions, with automated tracking for in-depth behavioral analysis.

Authors:Xudong Lu, Haohao Gao, Renshou Wu, Shuai Ren, Xiaoxin Chen, Hongsheng Li, Fangyuan Li
Title: SmartBench: Is Your LLM Truly a Good Chinese Smartphone Assistant?
Abstract:
Large Language Models (LLMs) have become integral to daily life, especially advancing as intelligent assistants through on-device deployment on smartphones. However, existing LLM evaluation benchmarks predominantly focus on objective tasks like mathematics and coding in English, which do not necessarily reflect the practical use cases of on-device LLMs in real-world mobile scenarios, especially for Chinese users. To address these gaps, we introduce SmartBench, the first benchmark designed to evaluate the capabilities of on-device LLMs in Chinese mobile contexts. We analyze functionalities provided by representative smartphone manufacturers and divide them into five categories: text summarization, text Q&A, information extraction, content creation, and notification management, further detailed into 20 specific tasks. For each task, we construct high-quality datasets comprising 50 to 200 question-answer pairs that reflect everyday mobile interactions, and we develop automated evaluation criteria tailored for these tasks. We conduct comprehensive evaluations of on-device LLMs and MLLMs using SmartBench and also assess their performance after quantized deployment on real smartphone NPUs. Our contributions provide a standardized framework for evaluating on-device LLMs in Chinese, promoting further development and optimization in this critical area. Code and data will be available at https://github.com/vivo-ai-lab/SmartBench.
中文: SmartBench是首个针对中文移动场景下设备端大语言模型的评估基准,涵盖五大功能类别并提供自动化评估标准,旨在弥补实际应用场景中的评估空白。
English: SmartBench is the first benchmark designed to evaluate on-device LLMs in Chinese mobile contexts, covering five key functional categories and providing automated evaluation criteria to address the gap in practical usage scenarios.

Authors:Yubin Kim, Hyewon Jeong, Shan Chen, Shuyue Stella Li, Mingyu Lu, Kumail Alhamoud, Jimin Mun, Cristina Grau, Minseok Jung, Rodrigo Gameiro, Lizhou Fan, Eugene Park, Tristan Lin, Joonsik Yoon, Wonjin Yoon, Maarten Sap, Yulia Tsvetkov, Paul Liang, Xuhai Xu, Xin Liu, Daniel McDuff, Hyeonhoon Lee, Hae Won Park, Samir Tulebaev, Cynthia Breazeal
Title: Medical Hallucinations in Foundation Models and Their Impact on Healthcare
Abstract:
Foundation Models that are capable of processing and generating multi-modal data have transformed AI's role in medicine. However, a key limitation of their reliability is hallucination, where inaccurate or fabricated information can impact clinical decisions and patient safety. We define medical hallucination as any instance in which a model generates misleading medical content. This paper examines the unique characteristics, causes, and implications of medical hallucinations, with a particular focus on how these errors manifest themselves in real-world clinical scenarios. Our contributions include (1) a taxonomy for understanding and addressing medical hallucinations, (2) benchmarking models using medical hallucination dataset and physician-annotated LLM responses to real medical cases, providing direct insight into the clinical impact of hallucinations, and (3) a multi-national clinician survey on their experiences with medical hallucinations. Our results reveal that inference techniques such as Chain-of-Thought (CoT) and Search Augmented Generation can effectively reduce hallucination rates. However, despite these improvements, non-trivial levels of hallucination persist. These findings underscore the ethical and practical imperative for robust detection and mitigation strategies, establishing a foundation for regulatory policies that prioritize patient safety and maintain clinical integrity as AI becomes more integrated into healthcare. The feedback from clinicians highlights the urgent need for not only technical advances but also for clearer ethical and regulatory guidelines to ensure patient safety. A repository organizing the paper resources, summaries, and additional information is available at https://github.com/mitmedialab/medical hallucination.
中文: 医学基础模型因产生误导性医疗内容的幻觉而存在可靠性问题,本研究通过提出分类法、基准测试和临床医生调查,强调需要检测策略和伦理指南以确保患者安全。
English: Foundation models in medicine face reliability issues due to hallucinations, which generate misleading medical content, and this study proposes a taxonomy, benchmarks models, and surveys clinicians to highlight the need for detection strategies and ethical guidelines to ensure patient safety.

Authors:Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mostafa Rifat Tazwar, Md Jobayer, Md. Mehedi Hasan Shawon, Md Rakibul Hasan
Title: CSTRL: Context-Driven Sequential Transfer Learning for Abstractive Radiology Report Summarization
Abstract:
A radiology report comprises several sections, including the Findings and Impression of the diagnosis. Automatically generating the Impression from the Findings is crucial for reducing radiologists' workload and improving diagnostic accuracy. Pretrained models that excel in common abstractive summarization problems encounter challenges when applied to specialized medical domains largely due to the complex terminology and the necessity for accurate clinical context. Such tasks in medical domains demand extracting core information, avoiding context shifts, and maintaining proper flow. Misuse of medical terms can lead to drastic clinical errors. To address these issues, we introduce a sequential transfer learning that ensures key content extraction and coherent summarization. Sequential transfer learning often faces challenges like initial parameter decay and knowledge loss, which we resolve with the Fisher matrix regularization. Using MIMIC-CXR and Open-I datasets, our model, CSTRL - Context-driven Sequential TRansfer Learning - achieved state-of-the-art performance, showing 56.2% improvement in BLEU-1, 40.5% in BLEU-2, 84.3% in BLEU-3, 28.9% in ROUGE-1, 41.0% in ROUGE-2 and 26.5% in ROGUE-3 score over benchmark studies. We also analyze factual consistency scores while preserving the medical context. Our code is publicly available at https://github.com/fahmidahossain/Report_Summarization.
中文: 本研究提出CSTRL模型,通过Fisher矩阵正则化的顺序迁移学习方法解决医学报告摘要中的专业术语和临床语境难题,在MIMIC-CXR和Open-I数据集上实现了最先进的性能,各项评估指标显著提升。
English: This study introduces CSTRL, a context-driven sequential transfer learning model that overcomes challenges in medical report summarization by using Fisher matrix regularization to prevent knowledge loss, achieving state-of-the-art performance on MIMIC-CXR and Open-I datasets with significant improvements in BLEU and ROUGE scores.

Authors:Shiping Yang, Jie Wu, Wenbiao Ding, Ning Wu, Shining Liang, Ming Gong, Hengyuan Zhang, Dongmei Zhang
Title: Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Abstract:
Robustness has become a critical attribute for the deployment of RAG systems in real-world applications. Existing research focuses on robustness to explicit noise (e.g., document semantics) but overlooks spurious features (a.k.a. implicit noise). While previous works have explored spurious features in LLMs, they are limited to specific features (e.g., formats) and narrow scenarios (e.g., ICL). In this work, we statistically confirm the presence of spurious features in the RAG paradigm, a robustness problem caused by the sensitivity of LLMs to semantic-agnostic features. Moreover, we provide a comprehensive taxonomy of spurious features and empirically quantify their impact through controlled experiments. Further analysis reveals that not all spurious features are harmful and they can even be beneficial sometimes. Extensive evaluation results across multiple LLMs suggest that spurious features are a widespread and challenging problem in the field of RAG. The code and dataset will be released to facilitate future research. We release all codes and data at: $\\\href{https://github.com/maybenotime/RAG-SpuriousFeatures}{https://github.com/maybenotime/RAG-SpuriousFeatures}$.
中文: 本研究通过系统分类和实证评估,揭示了RAG系统中普遍存在的虚假特征问题,并发现这些特征具有既可能有害也可能有益的双重性质。
English: This study identifies spurious features as a widespread robustness issue in RAG systems, revealing their dual nature of being both harmful and beneficial through comprehensive taxonomy and empirical evaluation.

Authors:Weigao Sun, Disen Lan, Tong Zhu, Xiaoye Qu, Yu Cheng
Title: Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts
Abstract:
Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github.com/OpenSparseLLMs/Linear-MoE.
中文: 本文提出Linear-MoE系统,通过整合线性序列建模的线性复杂度优势与混合专家的稀疏激活特性,在多种基准测试中实现了高效训练与优异性能。
English: This paper introduces Linear-MoE, a production-level system that integrates Linear Sequence Modeling for linear-complexity sequence processing and Mixture-of-Experts for sparse activation, achieving efficient training and competitive performance across various benchmarks.

Authors:Zhenxuan Zhang, Kinhei Lee, Peiyuan Jing, Weihang Deng, Huichi Zhou, Zihao Jin, Jiahao Huang, Zhifan Gao, Dominic C Marshall, Yingying Fang, Guang Yang
Title: GEMA-Score: Granular Explainable Multi-Agent Scoring Framework for Radiology Report Evaluation
Abstract:
Automatic medical report generation has the potential to support clinical diagnosis, reduce the workload of radiologists, and demonstrate potential for enhancing diagnostic consistency. However, current evaluation metrics often fail to reflect the clinical reliability of generated reports. Early overlap-based methods focus on textual matches between predicted and ground-truth entities but miss fine-grained clinical details (e.g., anatomical location, severity). Some diagnostic metrics are limited by fixed vocabularies or templates, reducing their ability to capture diverse clinical expressions. LLM-based approaches further lack interpretable reasoning steps, making it hard to assess or trust their behavior in safety-critical settings. These limitations hinder the comprehensive assessment of the reliability of generated reports and pose risks in their selection for clinical use. Therefore, we propose a Granular Explainable Multi-Agent Score (GEMA-Score) in this paper, which conducts both objective quantification and subjective evaluation through a large language model-based multi-agent workflow. Our GEMA-Score parses structured reports and employs stable calculations through interactive exchanges of information among agents to assess disease diagnosis, location, severity, and uncertainty. Additionally, an LLM-based scoring agent evaluates completeness, readability, and clinical terminology while providing explanatory feedback. Extensive experiments validate that GEMA-Score achieves the highest correlation with human expert evaluations on a public dataset, demonstrating its effectiveness in clinical scoring (Kendall coefficient = $0.69$ for ReXVal dataset and Kendall coefficient = $0.45$ for RadEvalX dataset). The anonymous project demo is available at: https://github.com/Zhenxuan-Zhang/GEMA_score.
中文:本文提出GEMA-Score这一新型多智能体评估框架,通过客观量化临床可靠性与主观评价报告质量,在医疗报告生成任务中实现了与专家评估的最佳吻合度。
English: This paper introduces GEMA-Score, a novel multi-agent evaluation framework that objectively quantifies clinical reliability and subjectively assesses report quality, achieving superior alignment with expert judgments in medical report generation.

Authors:Nikolai Ilinykh, Shalom Lappin, Asad Sayeed, Sharid Loáiciga
Title: Coreference as an indicator of context scope in multimodal narrative
Abstract:
We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality. Materials, metrics, and code for our study are available at https://github.com/GU-CLASP/coreference-context-scope.
中文摘要:大型多模态模型在视觉叙事任务中指代表达分布上与人类存在显著差异,难以有效追踪混合实体指代,尽管生成质量有所提升。
English Summary: Large multimodal models differ from humans in managing coreference distribution during visual storytelling, as they struggle with tracking mixed entity references despite improved generation quality.

Authors:Neemesh Yadav, Jiarui Liu, Francesco Ortu, Roya Ensafi, Zhijing Jin, Rada Mihalcea
Title: Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
Abstract:
The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship
中文摘要:本研究通过逆向工程分析多国内容审查决策,结合Shapley值和LLM生成解释来探究内容审核的内在机制,揭示了不同国家审查内容的特征模式,并评估了大语言模型在内容审核中的应用效果。
English Summary: This study investigates the hidden mechanisms of content moderation by reverse-engineering censorship decisions across countries and analyzing them through Shapley values and LLM-generated explanations, revealing distinct patterns in moderated content while evaluating LLMs' effectiveness in this domain.

Authors:Ruixi Lin, Ziqiao Wang, Yang You
Title: Ensemble Debiasing Across Class and Sample Levels for Fairer Prompting Accuracy
Abstract:
Language models are strong few-shot learners and achieve good overall accuracy in text classification tasks, masking the fact that their results suffer from great class accuracy imbalance. We believe that the pursuit of overall accuracy should not come from enriching the strong classes, but from raising up the weak ones. To address the imbalance, we propose a Heaviside step function based ensemble debiasing method, which enables flexible rectifications of in-context learned class probabilities at both class and sample levels. Evaluations with Llama-2-13B on seven text classification benchmarks show that our approach achieves state-of-the-art overall accuracy gains with balanced class accuracies. More importantly, we perform analyses on the resulted probability correction scheme, showing that sample-level corrections are necessary to elevate weak classes. Due to effectively correcting weak classes, our method also brings significant performance gains to a larger model variant, Llama-2-70B, especially on a biomedical domain task, further demonstrating the necessity of ensemble debiasing at both levels. Our source code is available at https://github.com/NUS-HPC-AI-Lab/DCS.
Chinese: 语言模型在文本分类中总体准确率高但存在类别不平衡问题,我们提出的基于Heaviside阶跃函数的集成去偏方法通过在类别和样本层面灵活修正概率,有效提升了弱类别性能,实现了均衡且领先的分类效果。
English: Language models achieve high overall accuracy in text classification but suffer from class imbalance, which our proposed Heaviside step function-based ensemble debiasing method effectively addresses by correcting probabilities at both class and sample levels, leading to state-of-the-art balanced performance.

Authors:Bowen Wu, Wenqing Wang, Haoran Li, Ying Li, Jingsong Yu, Baoxun Wang
Title: Interpersonal Memory Matters: A New Task for Proactive Dialogue Utilizing Conversational History
Abstract:
Proactive dialogue systems aim to empower chatbots with the capability of leading conversations towards specific targets, thereby enhancing user engagement and service autonomy. Existing systems typically target pre-defined keywords or entities, neglecting user attributes and preferences implicit in dialogue history, hindering the development of long-term user intimacy. To address these challenges, we take a radical step towards building a more human-like conversational agent by integrating proactive dialogue systems with long-term memory into a unified framework. Specifically, we define a novel task named Memory-aware Proactive Dialogue (MapDia). By decomposing the task, we then propose an automatic data construction method and create the first Chinese Memory-aware Proactive Dataset (ChMapData). Furthermore, we introduce a joint framework based on Retrieval Augmented Generation (RAG), featuring three modules: Topic Summarization, Topic Retrieval, and Proactive Topic-shifting Detection and Generation, designed to steer dialogues towards relevant historical topics at the right time. The effectiveness of our dataset and models is validated through both automatic and human evaluations. We release the open-source framework and dataset at https://github.com/FrontierLabs/MapDia.
Chinese Summary: 本研究提出将主动对话系统与长期记忆相结合的统一框架,定义了MapDia新任务并构建首个中文记忆感知数据集ChMapData,使聊天机器人能在适当时机引导对话转向相关历史话题。
English Summary: This study introduces a unified framework integrating proactive dialogue systems with long-term memory, proposing the novel MapDia task and creating the first Chinese dataset (ChMapData) to enable chatbots to steer conversations toward relevant historical topics at appropriate moments.

Authors:Tianjun Wei, Wei Wen, Ruizhi Qiao, Xing Sun, Jianghong Ma
Title: RocketEval: Efficient Automated LLM Evaluation via Grading Checklist
Abstract:
Evaluating large language models (LLMs) in diverse and challenging scenarios is essential to align them with human preferences. To mitigate the prohibitive costs associated with human evaluations, utilizing a powerful LLM as a judge has emerged as a favored approach. Nevertheless, this methodology encounters several challenges, including substantial expenses, concerns regarding privacy and security, and reproducibility. In this paper, we propose a straightforward, replicable, and accurate automated evaluation method by leveraging a lightweight LLM as the judge, named RocketEval. Initially, we identify that the performance disparity between lightweight and powerful LLMs in evaluation tasks primarily stems from their ability to conduct comprehensive analyses, which is not easily enhanced through techniques such as chain-of-thought reasoning. By reframing the evaluation task as a multi-faceted Q&A using an instance-specific checklist, we demonstrate that the limited judgment accuracy of lightweight LLMs is largely attributes to high uncertainty and positional bias. To address these challenges, we introduce an automated evaluation process grounded in checklist grading, which is designed to accommodate a variety of scenarios and questions. This process encompasses the creation of checklists, the grading of these checklists by lightweight LLMs, and the reweighting of checklist items to align with the supervised annotations. Our experiments carried out on the automated evaluation benchmarks, MT-Bench and WildBench datasets, reveal that RocketEval, when using Gemma-2-2B as the judge, achieves a high correlation (0.965) with human preferences, which is comparable to GPT-4o. Moreover, RocketEval provides a cost reduction exceeding 50-fold for large-scale evaluation and comparison scenarios. Our code is available at https://github.com/Joinn99/RocketEval-ICLR .
中文摘要:RocketEval提出了一种采用轻量级大语言模型作为评估者的自动化评估方法,通过基于清单的评分机制解决不确定性和位置偏差问题,在实现与人类评估高度相关的同时将大规模评估成本降低了50倍以上。
English Summary: RocketEval introduces a cost-effective automated evaluation method using lightweight LLMs as judges, achieving human-level correlation while reducing costs by over 50 times through checklist-based grading that addresses uncertainty and bias.

Authors:Xuheng Cai, Erica Zhang
Title: HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model
Abstract:
Egyptian hieroglyphs are found on numerous ancient Egyptian artifacts, but it is common that they are blurry or even missing due to erosion. Existing efforts to restore blurry hieroglyphs adopt computer vision techniques such as CNNs and model hieroglyph recovery as an image classification task, which suffers from two major limitations: (i) They cannot handle severely damaged or completely missing hieroglyphs. (ii) They make predictions based on a single hieroglyph without considering contextual and grammatical information. This paper proposes a novel approach to model hieroglyph recovery as a next word prediction task and use language models to address it. We compare the performance of different SOTA language models and choose LSTM as the architecture of our HieroLM due to the strong local affinity of semantics in Egyptian hieroglyph texts. Experiments show that HieroLM achieves over 44% accuracy and maintains notable performance on multi-shot predictions and scarce data, which makes it a pragmatic tool to assist scholars in inferring missing hieroglyphs. It can also complement CV-based models to significantly reduce perplexity in recognizing blurry hieroglyphs. Our code is available at https://github.com/Rick-Cai/HieroLM/.
中文:本文提出HieroLM新方法,将象形文字修复建模为语言模型的下一个词预测任务,通过利用上下文信息实现了超过44%的准确率,能有效处理严重损坏或缺失的象形文字,其性能优于传统计算机视觉方法。
English: This paper introduces HieroLM, a novel approach that models hieroglyph recovery as a next word prediction task using language models, achieving over 44% accuracy and effectively handling severely damaged or missing hieroglyphs by leveraging contextual information, outperforming traditional computer vision methods.

Authors:Souvik Kundu, Anahita Bhiwandiwalla, Sungduk Yu, Phillip Howard, Tiep Le, Sharath Nittur Sridhar, David Cobbley, Hao Kang, Vasudev Lal
Title: LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
Abstract:
Despite recent efforts in understanding the compression impact on large language models (LLMs) in terms of their downstream task performance and trustworthiness on relatively simpler uni-modal benchmarks (for example, question answering, common sense reasoning), their detailed study on multi-modal Large Vision-Language Models (LVLMs) is yet to be unveiled. Towards mitigating this gap, we present LVLM-Compress-Bench, a framework to first thoroughly study the broad impact of compression on the generative performance of LVLMs with multi-modal input driven tasks. In specific, we consider two major classes of compression for autoregressive models, namely KV cache and weight compression, for the dynamically growing intermediate cache and static weights, respectively. We use four LVLM variants of the popular LLaVA framework to present our analysis via integrating various state-of-the-art KV and weight compression methods including uniform, outlier-reduced, and group quantization for the KV cache and weights. With this framework we demonstrate on ten different multi-modal datasets with different capabilities including recognition, knowledge, language generation, spatial awareness, visual reasoning, hallucination and visual illusion identification, toxicity, stereotypes and bias. In specific, our framework demonstrates the compression impact on both general and ethically critical metrics leveraging a combination of real world and synthetic datasets to encompass diverse societal intersectional attributes. Extensive experimental evaluations yield diverse and intriguing observations on the behavior of LVLMs at different quantization budget of KV and weights, in both maintaining and losing performance as compared to the baseline model with FP16 data format. Code will be open-sourced at https://github.com/opengear-project/LVLM-compress-bench.
中文: 本研究提出LVLM-Compress-Bench框架,全面评估压缩技术对多模态大视觉语言模型在各项任务中性能的影响,揭示了其在通用指标和伦理指标上的多样化表现。
English: This study introduces LVLM-Compress-Bench, a framework to comprehensively evaluate how compression techniques affect the performance of multimodal large vision-language models across various tasks, revealing diverse impacts on both general and ethical metrics.

Authors:Mahfuz Ahmed Anik, Abdur Rahman, Azmine Toushik Wasi, Md Manjurul Ahsan
Title: Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Abstract:
Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction. Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance, leading to translations that marginalize linguistic diversity. To address these challenges, we propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities. Our approach leverages specialized agents for translation, interpretation, content synthesis, and bias evaluation, ensuring that linguistic accuracy and cultural relevance are preserved. Using CrewAI and LangChain, our system enhances contextual fidelity while mitigating biases through external validation. Comparative analysis shows that our framework outperforms GPT-4o, producing contextually rich and culturally embedded translations, a critical advancement for Indigenous, regional, and low-resource languages. This research underscores the potential of multi-agent AI in fostering equitable, sustainable, and culturally sensitive NLP technologies, aligning with the AI Governance, Cultural NLP, and Sustainable NLP pillars of Language Models for Underserved Communities. Our full experimental codebase is publicly available at: https://github.com/ciol-researchlab/Context-Aware_Translation_MAS
中文摘要:本研究提出了一种多智能体AI框架,通过专业代理和偏见缓解技术,为弱势语言群体提供文化适应性翻译,在保持语言准确性和文化相关性方面优于GPT-4o。
English Summary: This research introduces a multi-agent AI framework that enhances culturally adaptive translation for underserved languages, outperforming GPT-4o by preserving linguistic accuracy and cultural relevance through specialized agents and bias mitigation.

Authors:Stephen Chung, Wenyu Du, Jie Fu
Title: Learning from Failures in Multi-Attempt Reinforcement Learning
Abstract:
Recent advancements in reinforcement learning (RL) for large language models (LLMs), exemplified by DeepSeek R1, have shown that even a simple question-answering task can substantially improve an LLM's reasoning capabilities. In this work, we extend this approach by modifying the task into a multi-attempt setting. Instead of generating a single response per question, the model is given multiple attempts, with feedback provided after incorrect responses. The multi-attempt task encourages the model to refine its previous attempts and improve search efficiency. Experimental results show that even a small LLM trained on a multi-attempt task achieves significantly higher accuracy when evaluated with more attempts, improving from 45.6% with 1 attempt to 52.5% with 2 attempts on the math benchmark. In contrast, the same LLM trained on a standard single-turn task exhibits only a marginal improvement, increasing from 42.3% to 43.2% when given more attempts during evaluation. The results indicate that, compared to the standard single-turn task, an LLM trained on a multi-attempt task achieves slightly better performance on math benchmarks while also learning to refine its responses more effectively based on user feedback. Full code is available at https://github.com/DualityRL/multi-attempt
中文:最新研究表明,通过在多轮尝试任务中结合反馈来训练大语言模型,可显著提升其推理准确性和答案优化能力,效果优于单轮训练方法。
English: Recent research demonstrates that training large language models on multi-attempt tasks with feedback significantly enhances their reasoning accuracy and ability to refine responses, outperforming single-turn training methods.

Authors:Jie Ouyang, Tingyue Pan, Mingyue Cheng, Ruiran Yan, Yucong Luo, Jiaying Lin, Qi Liu
Title: HoH: A Dynamic Benchmark for Evaluating the Impact of Outdated Information on Retrieval-Augmented Generation
Abstract:
While Retrieval-Augmented Generation (RAG) has emerged as an effective approach for addressing the knowledge outdating problem in Large Language Models (LLMs), it still faces a critical challenge: the prevalence of outdated information in knowledge bases. Current research primarily focuses on incorporating up-to-date information, yet the impact of outdated information coexisting in retrieval sources remains inadequately addressed. To bridge this gap, we introduce HoH, the first benchmark specifically designed to evaluate the impact of outdated information on RAG. Our benchmark leverages token-level diff algorithms combined with LLM pipelines to efficiently create a large-scale QA dataset that accurately captures the evolution of temporal knowledge in real-world facts. Through comprehensive experiments, we reveal that outdated information significantly degrades RAG performance in two critical ways: (1) it substantially reduces response accuracy by distracting models from correct information, and (2) it can mislead models into generating potentially harmful outputs, even when current information is available. Current RAG approaches struggle with both retrieval and generation aspects when handling outdated information. These findings highlight the urgent need for innovative solutions to address the temporal challenges in RAG. Our code and data are available at: https://github.com/0russwest0/HoH.
中文:HoH基准测试揭示,检索源中的过时信息会显著降低RAG系统性能,不仅削弱回答准确性还可能产生有害输出,这凸显了解决RAG时序挑战的迫切需求。
English: The HoH benchmark reveals that outdated information in retrieval sources significantly reduces RAG performance by lowering response accuracy and potentially causing harmful outputs, highlighting the need for solutions to temporal challenges in RAG systems.

Authors:Sumin Ha, Jun Hyeong Kim, Yinhua Piao, Sun Kim
Title: MV-CLAM: Multi-View Molecular Interpretation with Cross-Modal Projection via Language Model
Abstract:
Human expertise in chemistry and biomedicine relies on contextual molecular understanding, a capability that large language models (LLMs) can extend through fine-grained alignment between molecular structures and text. Recent multimodal learning advances focus on cross-modal alignment, but existing molecule-text models ignore complementary information in different molecular views and rely on single-view representations, limiting molecular understanding. Moreover, naïve multi-view alignment strategies face two challenges: (1) separate aligned spaces with inconsistent mappings between molecule and text embeddings, and that (2) existing loss objectives fail to preserve complementary information for fine-grained alignment. This can limit the LLM's ability to fully understand the molecular properties. To address these issues, we propose MV-CLAM, a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer (MQ-Former). Our approach ensures cross-view consistency while a token-level contrastive loss preserves diverse molecular features across textual queries. MV-CLAM enhances molecular reasoning, improving retrieval and captioning accuracy. The source code of MV-CLAM is available in https://github.com/sumin124/mv-clam.git.
中文:MV-CLAM是一个新颖的框架,它通过多查询变换器将多视角分子表征对齐到统一的文本空间,从而通过提升检索和描述准确性来增强分子推理能力。
English: MV-CLAM is a novel framework that aligns multi-view molecular representations into a unified textual space using a multi-query transformer, enhancing molecular reasoning by improving retrieval and captioning accuracy.

Authors:Jules Viennot, Guillaume Baudart, Emilio Jesùs Gallego Arias, Marc Lelarge
Title: MiniF2F in Rocq: Automatic Translation Between Proof Assistants -- A Case Study
Abstract:
In this work, we conduct an experiment using state-of-the-art LLMs to translate MiniF2F into Rocq. The translation task focuses on generating a Rocq theorem based on three sources: a natural language description, the Lean formalization, and the Isabelle formalization. We conducted our experiment in 3 stages of increasing complexity, from basic one-shot prompting to multi-turn conversations that incorporate feedback from unsuccessful attempts. At each stage, we perform multiple rounds of translation using increasingly advanced models: GPT-4o mini, Claude 3.5 Sonnet, o1 mini, and o1. We successfully translated 478 out of 488 theorems. The dataset is opensource: https://github.com/LLM4Rocq/miniF2F-rocq.
中文: 本研究通过逐步复杂的提示阶段,利用先进的大语言模型成功将488个定理中的478个从MiniF2F翻译为Rocq,并公开了数据集。
English: This study successfully translated 478 out of 488 theorems from MiniF2F to Rocq using advanced LLMs through progressively complex prompting stages, with the dataset made publicly available.

Authors:Hritik Bansal, Pratyush Maini
Title: Peeking Behind Closed Doors: Risks of LLM Evaluation by Private Data Curators
Abstract:
The rapid advancement in building large language models (LLMs) has intensified competition among big-tech companies and AI startups. In this regard, model evaluations are critical for product and investment-related decision-making. While open evaluation sets like MMLU initially drove progress, concerns around data contamination and data bias have constantly questioned their reliability. As a result, it has led to the rise of private data curators who have begun conducting hidden evaluations with high-quality self-curated test prompts and their own expert annotators. In this paper, we argue that despite potential advantages in addressing contamination issues, private evaluations introduce inadvertent financial and evaluation risks. In particular, the key concerns include the potential conflict of interest arising from private data curators' business relationships with their clients (leading LLM firms). In addition, we highlight that the subjective preferences of private expert annotators will lead to inherent evaluation bias towards the models trained with the private curators' data. Overall, this paper lays the foundation for studying the risks of private evaluations that can lead to wide-ranging community discussions and policy changes.

Authors:Zheng Hui, Yinheng Li, Dan zhao, Tianyi Chen, Colby Banbury, Kazuhito Koishida
Title: WinClick: GUI Grounding with Multimodal Large Language Models
Abstract:
Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key challenge in developing visual GUI agents: GUI grounding - the ability to accurately locate screen elements based on instructions. However, most existing GUI agents rely on structured data formats like DOM or HTML files in training or inferencing, which are inaccessible across all applications, particular in a general desktop environments such as Windows OS. To address this, we introduce WinClick, a novel visual GUI agent developed in Windows platform. WinClick leverages screenshots to detect actionable regions. To overcome the challenge of GUI grounding, we enhance WinClick with GUI grounding pre-training and propose an LLM-based method for aligning GUI grounding data. Additionally, we introduce WinSpot, the first comprehensive benchmark for GUI grounding on Windows. Our experiments demonstrate that WinClick, combined with GUI grounding pre-training, significantly outperforms existing baselines, offering a scalable solution for GUI automation in desktop environments. WinSpot is publicly available at https://github.com/zackhuiiiii/WinSpot.
中文:WinClick是一种创新的Windows视觉GUI代理,通过截图和增强的GUI基础预训练来自动化桌面任务,其性能优于现有方法,并得到WinSpot基准测试的支持。
English: WinClick is a novel visual GUI agent for Windows that uses screenshots and enhanced GUI grounding pre-training to automate desktop tasks, outperforming existing methods and supported by the WinSpot benchmark.

Authors:Sambal Shikhar, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jean Lahoud, Fahad Khan, Rao Muhammad Anwer, Salman Khan, Hisham Cholakkal
Title: LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Abstract:
Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .
中文: LLMVoX是一种轻量级、与LLM无关的流式TTS系统,能以低延迟生成高质量语音并保持基础LLM能力,支持无缝无限对话和多语言任务,无需额外多模态训练。
English: LLMVoX is a lightweight, LLM-agnostic streaming TTS system that generates high-quality speech with low latency while preserving the base LLM's capabilities, supporting seamless infinite dialogues and multilingual tasks without additional multimodal training.

Authors:Anuj Diwan, Zhisheng Zheng, David Harwath, Eunsol Choi
Title: Scaling Rich Style-Prompted Text-to-Speech Datasets
Abstract:
We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .
中文: 我们推出ParaSpeechCaps大规模语音风格标注数据集,结合人工标注与自动扩展数据,显著提升了语音合成模型的风格一致性和自然度表现。
English: We introduce ParaSpeechCaps, a large-scale dataset with rich style annotations for speech, combining human-labeled and automatically scaled data to significantly improve text-to-speech model performance in style consistency and naturalness.

Authors:Dou Hu, Lingwei Wei, Wei Zhou, Songlin Hu
Title: An Information-theoretic Multi-task Representation Learning Framework for Natural Language Understanding
Abstract:
This paper proposes a new principled multi-task representation learning framework (InfoMTL) to extract noise-invariant sufficient representations for all tasks. It ensures sufficiency of shared representations for all tasks and mitigates the negative effect of redundant features, which can enhance language understanding of pre-trained language models (PLMs) under the multi-task paradigm. Firstly, a shared information maximization principle is proposed to learn more sufficient shared representations for all target tasks. It can avoid the insufficiency issue arising from representation compression in the multi-task paradigm. Secondly, a task-specific information minimization principle is designed to mitigate the negative effect of potential redundant features in the input for each task. It can compress task-irrelevant redundant information and preserve necessary information relevant to the target for multi-task prediction. Experiments on six classification benchmarks show that our method outperforms 12 comparative multi-task methods under the same multi-task settings, especially in data-constrained and noisy scenarios. Extensive experiments demonstrate that the learned representations are more sufficient, data-efficient, and robust.
中文摘要:本文提出InfoMTL多任务学习框架,通过共享信息最大化和任务特定信息最小化原则,提取噪声不变的充分表征,增强预训练语言模型在多任务场景下的理解能力,在多个基准测试中优于现有方法。
English Summary: The paper introduces InfoMTL, a multi-task learning framework that enhances language models by extracting noise-invariant sufficient representations through shared information maximization and task-specific information minimization, outperforming existing methods in various scenarios.

Authors:Wen Yang, Junhong Wu, Chen Wang, Chengqing Zong, Jiajun Zhang
Title: Implicit Cross-Lingual Rewarding for Efficient Multilingual Preference Alignment
Abstract:
Direct Preference Optimization (DPO) has become a prominent method for aligning Large Language Models (LLMs) with human preferences. While DPO has enabled significant progress in aligning English LLMs, multilingual preference alignment is hampered by data scarcity. To address this, we propose a novel approach that $\textit{captures}$ learned preferences from well-aligned English models by implicit rewards and $\textit{transfers}$ them to other languages through iterative training. Specifically, we derive an implicit reward model from the logits of an English DPO-aligned model and its corresponding reference model. This reward model is then leveraged to annotate preference relations in cross-lingual instruction-following pairs, using English instructions to evaluate multilingual responses. The annotated data is subsequently used for multilingual DPO fine-tuning, facilitating preference knowledge transfer from English to other languages. Fine-tuning Llama3 for two iterations resulted in a 12.72% average improvement in Win Rate and a 5.97% increase in Length Control Win Rate across all training languages on the X-AlpacaEval leaderboard. Our findings demonstrate that leveraging existing English-aligned models can enable efficient and effective multilingual preference alignment, significantly reducing the need for extensive multilingual preference data. The code is available at https://github.com/ZNLP/Implicit-Cross-Lingual-Rewarding
中文摘要:直接偏好优化(DPO)通过从英语对齐模型中提取隐式奖励,将习得的偏好知识迁移至其他语言,实现了无需大量多语言数据的高效跨语言偏好对齐,显著提升了模型性能。
English Summary: Direct Preference Optimization (DPO) enables effective multilingual preference alignment by transferring learned preferences from English-aligned models to other languages through implicit rewards, achieving significant performance improvements without extensive multilingual data.

Authors:Xiangchao Yan, Shiyang Feng, Jiakang Yuan, Renqiu Xia, Bin Wang, Bo Zhang, Lei Bai
Title: SurveyForge: On the Outline Heuristics, Memory-Driven Generation, and Multi-dimensional Evaluation for Automated Survey Writing
Abstract:
Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications. Recently, researchers have begun using LLMs to automate survey generation for better efficiency. However, the quality gap between LLM-generated surveys and those written by human remains significant, particularly in terms of outline quality and citation accuracy. To close these gaps, we introduce SurveyForge, which first generates the outline by analyzing the logical structure of human-written outlines and referring to the retrieved domain-related articles. Subsequently, leveraging high-quality papers retrieved from memory by our scholar navigation agent, SurveyForge can automatically generate and refine the content of the generated article. Moreover, to achieve a comprehensive evaluation, we construct SurveyBench, which includes 100 human-written survey papers for win-rate comparison and assesses AI-generated survey papers across three dimensions: reference, outline, and content quality. Experiments demonstrate that SurveyForge can outperform previous works such as AutoSurvey.
中文摘要:SurveyForge通过分析人工撰写的大纲结构并参考检索的领域文献来生成大纲,利用学者导航代理检索高质量论文自动生成并优化内容,实验证明其性能优于AutoSurvey等先前工作。
English Summary: SurveyForge is introduced to bridge the quality gap in LLM-generated surveys by analyzing human-written outlines and leveraging retrieved scholarly articles, with experiments showing it outperforms prior methods like AutoSurvey.

Authors:Aoxiong Yin, Kai Shen, Yichong Leng, Xu Tan, Xinyu Zhou, Juncheng Li, Siliang Tang
Title: The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation
Abstract:
Recent advancements in text-to-video (T2V) generation have been driven by two competing paradigms: autoregressive language models and diffusion models. However, each paradigm has intrinsic limitations: language models struggle with visual quality and error accumulation, while diffusion models lack semantic understanding and causal modeling. In this work, we propose LanDiff, a hybrid framework that synergizes the strengths of both paradigms through coarse-to-fine generation. Our architecture introduces three key innovations: (1) a semantic tokenizer that compresses 3D visual features into compact 1D discrete representations through efficient semantic compression, achieving a $\sim$14,000$\times$ compression ratio; (2) a language model that generates semantic tokens with high-level semantic relationships; (3) a streaming diffusion model that refines coarse semantics into high-fidelity videos. Experiments show that LanDiff, a 5B model, achieves a score of 85.43 on the VBench T2V benchmark, surpassing the state-of-the-art open-source models Hunyuan Video (13B) and other commercial models such as Sora, Kling, and Hailuo. Furthermore, our model also achieves state-of-the-art performance in long video generation, surpassing other open-source models in this field. Our demo can be viewed at https://landiff.github.io/.
中文:LanDiff是一种结合自回归语言模型与扩散模型优势的混合文本到视频框架,通过从粗到细的生成方式克服了两者固有缺陷,在标准及长视频生成基准测试中均实现了最先进的性能表现。
English: LanDiff is a hybrid text-to-video framework that combines autoregressive language models and diffusion models to overcome their individual limitations, achieving state-of-the-art performance in both standard and long video generation benchmarks.

Authors:Zhijian Zhuo, Yutao Zeng, Ya Wang, Sijun Zhang, Jian Yang, Xiaoqing Li, Xun Zhou, Jinwen Ma
Title: HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
Abstract:
Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the position of layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$, a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence demonstrating that HybridNorm improves gradient flow and model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.
中文: 本文提出HybridNorm混合归一化策略,通过结合Pre-Norm和Post-Norm的优势来改善深度Transformer的梯度流动和鲁棒性,在多个基准测试中持续优于现有方法。
English: The paper introduces HybridNorm, a hybrid normalization strategy that combines Pre-Norm and Post-Norm to enhance gradient flow and robustness in deep transformers, consistently outperforming existing methods across benchmarks.

Authors:Armel Zebaze, Benoît Sagot, Rachel Bawden
Title: Compositional Translation: A Novel LLM-based Approach for Low-resource Machine Translation
Abstract:
The ability of generative large language models (LLMs) to perform in-context learning has given rise to a large body of research into how best to prompt models for various natural language processing tasks. Machine Translation (MT) has been shown to benefit from in-context examples, in particular when they are semantically similar to the sentence to translate. In this paper, we propose a new LLM-based translation paradigm, compositional translation, to replace naive few-shot MT with similarity-based demonstrations. An LLM is used to decompose a sentence into simpler phrases, and then to translate each phrase with the help of retrieved demonstrations. Finally, the LLM is prompted to translate the initial sentence with the help of the self-generated phrase-translation pairs. Our intuition is that this approach should improve translation because these shorter phrases should be intrinsically easier to translate and easier to match with relevant examples. This is especially beneficial in low-resource scenarios, and more generally whenever the selection pool is small or out of domain. We show that compositional translation boosts LLM translation performance on a wide range of popular MT benchmarks, including FLORES 200, NTREX 128 and TICO-19. Code and outputs are available at https://github.com/ArmelRandy/compositional-translation
中文:本文提出组合式翻译这一新型基于大语言模型的翻译范式,通过将句子分解为更简单的短语并借助检索示例进行翻译,显著提升了在低资源场景下多个机器翻译基准测试的性能表现。
English: This paper introduces compositional translation, a novel LLM-based approach that decomposes sentences into simpler phrases for translation using retrieved demonstrations, significantly enhancing performance across multiple machine translation benchmarks especially in low-resource settings.

Authors:Zhipeng Chen, Yingqian Min, Beichen Zhang, Jie Chen, Jinhao Jiang, Daixuan Cheng, Wayne Xin Zhao, Zheng Liu, Xu Miao, Yang Lu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Title: An Empirical Study on Eliciting and Improving R1-like Reasoning Models
Abstract:
In this report, we present the third technical report on the development of slow-thinking models as part of the STILL project. As the technical pathway becomes clearer, scaling RL training has become a central technique for implementing such reasoning models. We systematically experiment with and document the effects of various factors influencing RL training, conducting experiments on both base models and fine-tuned models. Specifically, we demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models, enhancing both response length and test accuracy. Furthermore, we show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already achieved a high performance level, it can be further refined through RL training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models. This approach achieves a remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its effectiveness in enhancing model capabilities. We release our resources at the STILL project website: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.
中文: 本报告介绍了STILL项目中慢思考模型的进展,通过强化学习训练有效提升了Qwen2.5-32B模型的响应长度与测试准确率,并将DeepSeek-R1-Distill-Qwen-1.5B模型在AIME 2024上的准确率优化至39.33%,同时工具操作技术更使推理准确率达到86.67%,相关资源已发布于GitHub平台。
English: This report details the STILL project's progress in developing slow-thinking models through scaled RL training, which consistently enhances model performance, including boosting the Qwen2.5-32B's response length and accuracy and refining the DeepSeek-R1-Distill-Qwen-1.5B to 39.33% on AIME 2024, while tool manipulation further achieves 86.67% accuracy, with resources available on GitHub.

Authors:Wenke Huang, Jian Liang, Xianda Guo, Yiyang Fang, Guancheng Wan, Xuankun Rong, Chi Wen, Zekun Shi, Qingyun Li, Didi Zhu, Yanbiao Ma, Ke Liang, Bin Yang, He Li, Jiawei Shao, Mang Ye, Bo Du
Title: Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model
Abstract:
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. While MLLMs demonstrate remarkable versatility, MLLMs appears limited performance on special applications. But tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert Specialization, where distribution shifts between pre-training and target datasets constrain target performance, and Open-World Stabilization, where catastrophic forgetting erases the model general knowledge. In this work, we systematically review recent advancements in MLLM tuning methodologies, classifying them into three paradigms: (I) Selective Tuning, (II) Additive Tuning, and (III) Reparameterization Tuning. Furthermore, we benchmark these tuning strategies across popular MLLM architectures and diverse downstream tasks to establish standardized evaluation analysis and systematic tuning principles. Finally, we highlight several open challenges in this domain and propose future research directions. To facilitate ongoing progress in this rapidly evolving field, we provide a public repository that continuously tracks developments: https://github.com/WenkeHuang/Awesome-MLLM-Tuning.
中文: 本文系统综述了多模态大语言模型的微调方法,通过对比基准测试解决任务专精与知识保持等挑战,并提出了未来研究方向。
English: This paper systematically reviews multi-modal large language model tuning methodologies, addressing challenges like task specialization and knowledge retention through comparative benchmarking and proposing future research directions.

Authors:Dimitri von Rütte, Janis Fluri, Yuhui Ding, Antonio Orvieto, Bernhard Schölkopf, Thomas Hofmann
Title: Generalized Interpolating Discrete Diffusion
Abstract:
While state-of-the-art language models achieve impressive results through next-token prediction, they have inherent limitations such as the inability to revise already generated tokens. This has prompted exploration of alternative approaches such as discrete diffusion. However, masked diffusion, which has emerged as a popular choice due to its simplicity and effectiveness, reintroduces this inability to revise words. To overcome this, we generalize masked diffusion, deriving a new family of general interpolating discrete diffusion (GIDD) which offers greater flexibility in the design of the noising processes. Leveraging a novel diffusion ELBO, we achieve compute-matched state-of-the-art performance in diffusion language modeling. Exploiting GIDD's flexibility, we explore a hybrid approach combining masking and uniform noise, leading to improved sample quality and unlocking the ability for the model to correct its own mistakes, an area where autoregressive models notoriously have struggled. Code: https://github.com/dvruette/gidd/
中文: 本文提出了一种新的通用插值离散扩散(GIDD)模型系列,通过灵活的噪声处理设计克服了现有方法的修订限制,实现了最先进的性能并具备了自我纠错能力。
English: This paper introduces a new family of general interpolating discrete diffusion (GIDD) models that overcome the revision limitations of existing approaches by offering flexible noising processes, achieving state-of-the-art performance and enabling self-correction capabilities.

Authors:Hyunwoo Yoo
Title: Can Large Language Models Predict Antimicrobial Resistance Gene?
Abstract:
This study demonstrates that generative large language models can be utilized in a more flexible manner for DNA sequence analysis and classification tasks compared to traditional transformer encoder-based models. While recent encoder-based models such as DNABERT and Nucleotide Transformer have shown significant performance in DNA sequence classification, transformer decoder-based generative models have not yet been extensively explored in this field. This study evaluates how effectively generative Large Language Models handle DNA sequences with various labels and analyzes performance changes when additional textual information is provided. Experiments were conducted on antimicrobial resistance genes, and the results show that generative Large Language Models can offer comparable or potentially better predictions, demonstrating flexibility and accuracy when incorporating both sequence and textual information. The code and data used in this work are available at the following GitHub repository: https://github.com/biocomgit/llm4dna.
Chinese: 本研究表明,在DNA序列分析中,生成式大语言模型比传统的编码器模型更具灵活性,结合序列与文本信息时能提供相当甚至更优的预测准确性。
English: This study shows that generative large language models offer greater flexibility and comparable or better accuracy than traditional encoder-based models for DNA sequence analysis, especially when integrating both sequence and textual data.

Authors:Shahar Levy, Nir Mazor, Lihi Shalmon, Michael Hassid, Gabriel Stanovsky
Title: More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG
Abstract:
Retrieval-augmented generation (RAG) provides LLMs with relevant documents. Although previous studies noted that retrieving many documents can degrade performance, they did not isolate how the quantity of documents affects performance while controlling for context length. We evaluate various language models on custom datasets derived from a multi-hop QA task. We keep the context length and position of relevant information constant while varying the number of documents, and find that increasing the document count in RAG settings poses significant challenges for LLMs. Additionally, our results indicate that processing multiple documents is a separate challenge from handling long contexts. We also make the datasets and code available: https://github.com/shaharl6000/MoreDocsSameLen .
中文: 在检索增强生成中,即使上下文长度保持不变,增加文档数量仍对大型语言模型构成显著挑战,且处理多文档与应对长上下文属于不同的难题。
English: Increasing the number of documents in retrieval-augmented generation significantly challenges large language models, even with constant context length, and processing multiple documents presents a distinct difficulty from managing long contexts.

Authors:Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik
Title: TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge
Abstract:
The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.
中文: TRACT方法通过两阶段微调结合思维链推理与回归感知训练,在多项LLM作为评判者的数据集上显著超越了现有方法。
English: TRACT introduces a two-stage fine-tuning method combining chain-of-thought reasoning with regression-aware training, significantly outperforming existing approaches in LLM-as-a-judge evaluations across multiple datasets.

Authors:Yafu Li, Ronghao Zhang, Zhilin Wang, Huajian Zhang, Leyang Cui, Yongjing Yin, Tong Xiao, Yue Zhang
Title: Lost in Literalism: How Supervised Training Shapes Translationese in LLMs
Abstract:
Large language models (LLMs) have achieved remarkable success in machine translation, demonstrating impressive performance across diverse languages. However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge in LLM-based translation systems. Despite their pre-training on vast corpora of natural utterances, LLMs exhibit translationese errors and generate unexpected unnatural translations, stemming from biases introduced during supervised fine-tuning (SFT). In this work, we systematically evaluate the prevalence of translationese in LLM-generated translations and investigate its roots during supervised training. We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances. Empirical evaluations demonstrate that these approaches significantly reduce translationese while improving translation naturalness, validated by human evaluations and automatic metrics. Our findings highlight the need for training-aware adjustments to optimize LLM translation outputs, paving the way for more fluent and target-language-consistent translations. We release the data and code at https://github.com/yafuly/LLM_Translationese.
中文: 大语言模型在监督微调中产生的偏差导致译文生硬不自然,但通过优化参考译文和筛选训练数据等方法,可显著减少翻译腔并提升译文的流畅度。
English: Large language models often produce unnatural, literal translations known as translationese due to biases from supervised fine-tuning, but proposed methods like polishing references and filtering training data effectively reduce these errors and enhance translation naturalness.

Authors:Ziyi Yang, Fanqi Wan, Longguang Zhong, Canbin Huang, Guosheng Liang, Xiaojun Quan
Title: FuseChat-3.0: Preference Optimization Meets Heterogeneous Model Fusion
Abstract:
We introduce FuseChat-3.0, a suite of large language models (LLMs) developed by integrating the strengths of heterogeneous source LLMs into more compact target LLMs. Our source models include the powerful Gemma-2-27B-it, Mistral-Large-Instruct-2407, Qwen-2.5-72B-Instruct, and Llama-3.1-70B-Instruct. For target models, we focus on three widely-used smaller variants-Llama-3.1-8B-Instruct, Gemma-2-9B-it, and Qwen-2.5-7B-Instruct-along with two ultra-compact options, Llama-3.2-3B-Instruct and Llama-3.2-1B-Instruct. To leverage the diverse capabilities of these source models, we develop a specialized data construction protocol tailored to various tasks and domains. The FuseChat-3.0 training pipeline consists of two key stages: (1) supervised fine-tuning (SFT) to align the target and source model distributions, and (2) Direct Preference Optimization (DPO) to apply preferences from multiple source LLMs to fine-tune the target model. The resulting FuseChat-3.0 models exhibit significant performance gains across tasks such as instruction following, general knowledge, mathematics, and coding. As illustrated in Figure 1, using Llama-3.1-8B-Instruct as the target model, our fusion approach achieves an average improvement of 6.8 points across 14 benchmarks. Moreover, it demonstrates remarkable gains of 37.1 points and 30.1 points on the instruction-following benchmarks AlpacaEval-2 and Arena-Hard, respectively. Our code, models, and datasets are available at https://github.com/SLIT-AI/FuseChat-3.0.
中文:FuseChat-3.0通过两阶段训练流程将多个大型源模型的优势融合到更紧凑的目标模型中,在多项基准测试中实现了显著的性能提升。
English: FuseChat-3.0 integrates the strengths of multiple large source models into smaller target models through a two-stage training process, achieving significant performance improvements across various benchmarks.

Authors:Simin Chen, Pranav Pusarla, Baishakhi Ray
Title: Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination
Abstract:
The rapid evolution of code largelanguage models underscores the need for effective and transparent benchmarking of their reasoning capabilities. However, the current benchmarking approach heavily depends on publicly available, human-created datasets. The widespread use of these fixed benchmark datasets makes the benchmarking process to be static and thus particularly susceptible to data contamination, an unavoidable consequence of the extensive data collection processes used to train Code LLMs. Existing approaches that address data contamination often suffer from human effort limitations and imbalanced problem complexity. To tackle these challenges, we propose \tool, a novel benchmarking suite for evaluating Code LLMs under potential data contamination. Given a seed programming problem, \tool employs multiple agents to extract and modify the context without altering the core logic, generating semantically equivalent variations. We introduce a dynamic data generation methods and conduct empirical studies on two seed datasets across 21 Code LLMs. Results show that \tool effectively benchmarks reasoning capabilities under contamination risks while generating diverse problem sets to ensure consistent and reliable evaluations.

Authors:Runtao Zhou, Guangya Wan, Saadia Gabriel, Sheng Li, Alexander J Gates, Maarten Sap, Thomas Hartvigsen
Title: Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in reasoning tasks, leading to their widespread deployment. However, recent studies have highlighted concerning biases in these models, particularly in their handling of dialectal variations like African American English (AAE). In this work, we systematically investigate dialectal disparities in LLM reasoning tasks. We develop an experimental framework comparing LLM performance given Standard American English (SAE) and AAE prompts, combining LLM-based dialect conversion with established linguistic analyses. We find that LLMs consistently produce less accurate responses and simpler reasoning chains and explanations for AAE inputs compared to equivalent SAE questions, with disparities most pronounced in social science and humanities domains. These findings highlight systematic differences in how LLMs process and reason about different language varieties, raising important questions about the development and deployment of these systems in our multilingual and multidialectal world. Our code repository is publicly available at https://github.com/Runtaozhou/dialect_bias_eval.
中文: 大语言模型在处理方言时存在显著偏见,与非裔美国英语相比,标准美国英语的输入导致其推理准确性下降且逻辑链条更为简化,尤其在社会科学和人文领域表现明显。
English: Large Language Models exhibit significant dialectal bias, performing less accurately and with simpler reasoning for African American English compared to Standard American English, particularly in social sciences and humanities.

Authors:Feng Ni, Kui Huang, Yao Lu, Wenyu Lv, Guanzhong Wang, Zeyu Chen, Yi Liu
Title: PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
Abstract:
With the rapid advancement of digitalization, various document images are being applied more extensively in production and daily life, and there is an increasingly urgent need for fast and accurate parsing of the content in document images. Therefore, this report presents PP-DocBee, a novel multimodal large language model designed for end-to-end document image understanding. First, we develop a data synthesis strategy tailored to document scenarios in which we build a diverse dataset to improve the model generalization. Then, we apply a few training techniques, including dynamic proportional sampling, data preprocessing, and OCR postprocessing strategies. Extensive evaluations demonstrate the superior performance of PP-DocBee, achieving state-of-the-art results on English document understanding benchmarks and even outperforming existing open source and commercial models in Chinese document understanding. The source code and pre-trained models are publicly available at \href{https://github.com/PaddlePaddle/PaddleMIX}{https://github.com/PaddlePaddle/PaddleMIX}.
中文: 本报告提出的PP-DocBee多模态大语言模型,通过针对性的数据合成与训练策略,在英文和中文文档图像理解任务中均实现了最优性能。
English: This report introduces PP-DocBee, a multimodal large language model that achieves state-of-the-art performance in both English and Chinese document image understanding through specialized data synthesis and training techniques.

Authors:Wenhui Zhu, Xin Li, Xiwen Chen, Peijie Qiu, Vamsi Krishna Vasa, Xuanzhao Dong, Yanxi Chen, Natasha Lepore, Oana Dumitrascu, Yi Su, Yalin Wang
Title: RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models
Abstract:
Recently, Multimodal Large Language Models (MLLMs) have gained significant attention for their remarkable ability to process and analyze non-textual data, such as images, videos, and audio. Notably, several adaptations of general-domain MLLMs to the medical field have been explored, including LLaVA-Med. However, these medical adaptations remain insufficiently advanced in understanding and interpreting retinal images. In contrast, medical experts emphasize the importance of quantitative analyses for disease detection and interpretation. This underscores a gap between general-domain and medical-domain MLLMs: while general-domain MLLMs excel in broad applications, they lack the specialized knowledge necessary for precise diagnostic and interpretative tasks in the medical field. To address these challenges, we introduce \textit{RetinalGPT}, a multimodal conversational assistant for clinically preferred quantitative analysis of retinal images. Specifically, we achieve this by compiling a large retinal image dataset, developing a novel data pipeline, and employing customized visual instruction tuning to enhance both retinal analysis and enrich medical knowledge. In particular, RetinalGPT outperforms MLLM in the generic domain by a large margin in the diagnosis of retinal diseases in 8 benchmark retinal datasets. Beyond disease diagnosis, RetinalGPT features quantitative analyses and lesion localization, representing a pioneering step in leveraging LLMs for an interpretable and end-to-end clinical research framework. The code is available at https://github.com/Retinal-Research/RetinalGPT
中文: 多模态大语言模型在通用领域表现出色,但在医学领域,尤其是视网膜图像解读方面存在不足;为此开发的RetinalGPT通过定量分析、疾病诊断和病灶定位,显著提升了临床应用的准确性和可解释性。
English: Multimodal Large Language Models (MLLMs) are advancing but lack specialized capabilities for precise medical diagnostics, particularly in interpreting retinal images, leading to the development of RetinalGPT, which excels in quantitative analysis, disease diagnosis, and lesion localization for enhanced clinical applications.

Authors:Faiz Surani, Mirac Suzgun, Vyoma Raman, Christopher D. Manning, Peter Henderson, Daniel E. Ho
Title: AI for Scaling Legal Reform: Mapping and Redacting Racial Covenants in Santa Clara County
Abstract:
Legal reform can be challenging in light of the volume, complexity, and interdependence of laws, codes, and records. One salient example of this challenge is the effort to restrict and remove racially restrictive covenants, clauses in property deeds that historically barred individuals of specific races from purchasing homes. Despite the Supreme Court holding such racial covenants unenforceable in 1948, they persist in property records across the United States. Many jurisdictions have moved to identify and strike these provisions, including California, which mandated in 2021 that all counties implement such a process. Yet the scale can be overwhelming, with Santa Clara County (SCC) alone having over 24 million property deed documents, making purely manual review infeasible. We present a novel approach to addressing this pressing issue, developed through a partnership with the SCC Clerk-Recorder's Office. First, we leverage an open large language model, finetuned to detect racial covenants with high precision and recall. We estimate that this system reduces manual efforts by 86,500 person hours and costs less than 2% of the cost for a comparable off-the-shelf closed model. Second, we illustrate the County's integration of this model into responsible operational practice, including legal review and the creation of a historical registry, and release our model to assist the hundreds of jurisdictions engaged in similar efforts. Finally, our results reveal distinct periods of utilization of racial covenants, sharp geographic clustering, and the disproportionate role of a small number of developers in maintaining housing discrimination. We estimate that by 1950, one in four properties across the County were subject to racial covenants.

Authors:Cristian Jimenez-Romero, Alper Yegenoglu, Christian Blum
Title: Multi-Agent Systems Powered by Large Language Models: Applications in Swarm Intelligence
Abstract:
This work examines the integration of large language models (LLMs) into multi-agent simulations by replacing the hard-coded programs of agents with LLM-driven prompts. The proposed approach is showcased in the context of two examples of complex systems from the field of swarm intelligence: ant colony foraging and bird flocking. Central to this study is a toolchain that integrates LLMs with the NetLogo simulation platform, leveraging its Python extension to enable communication with GPT-4o via the OpenAI API. This toolchain facilitates prompt-driven behavior generation, allowing agents to respond adaptively to environmental data. For both example applications mentioned above, we employ both structured, rule-based prompts and autonomous, knowledge-driven prompts. Our work demonstrates how this toolchain enables LLMs to study self-organizing processes and induce emergent behaviors within multi-agent environments, paving the way for new approaches to exploring intelligent systems and modeling swarm intelligence inspired by natural phenomena. We provide the code, including simulation files and data at https://github.com/crjimene/swarm_gpt.
本研究通过将传统智能体程序替换为LLM驱动提示,将大语言模型集成到多智能体模拟中,以蚁群觅食和鸟群聚集为例,利用NetLogo-GPT-4工具链生成适应性行为并诱导涌现智能。
This study integrates large language models into multi-agent simulations by replacing traditional agent programming with LLM-driven prompts, demonstrated through ant foraging and bird flocking examples using a NetLogo-GPT-4 toolchain to generate adaptive behaviors and emergent intelligence.

Authors:Xuandong Zhao, Will Cai, Tianneng Shi, David Huang, Licong Lin, Song Mei, Dawn Song
Title: Improving LLM Safety Alignment with Dual-Objective Optimization
Abstract:
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks. Direct preference optimization (DPO), a widely deployed alignment method, exhibits limitations in both experimental and theoretical contexts as its loss function proves suboptimal for refusal learning. Through gradient-based analysis, we identify these shortcomings and propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge. This approach significantly increases LLM robustness against a wide range of jailbreak attacks, including prefilling, suffix, and multi-turn attacks across both in-distribution and out-of-distribution scenarios. Furthermore, we introduce a method to emphasize critical refusal tokens by incorporating a reward-based token-level weighting mechanism for refusal learning, which further improves the robustness against adversarial exploits. Our research also suggests that robustness to jailbreak attacks is correlated with token distribution shifts in the training process and internal representations of refusal and harmful tokens, offering valuable directions for future research in LLM safety alignment. The code is available at https://github.com/wicai24/DOOR-Alignment
中文: 本研究揭示了直接偏好优化(DPO)在大语言模型安全对齐中的缺陷,提出通过分离拒绝训练与有害知识遗忘的改进方法,借助词级加权和分布分析显著提升了针对各类越狱攻击的鲁棒性。
English: This study identifies vulnerabilities in Direct Preference Optimization (DPO) for LLM safety alignment and proposes an enhanced method that separates refusal training from harmful knowledge unlearning, significantly boosting robustness against diverse jailbreak attacks through token-level weighting and distribution analysis.

Authors:Rui Ye, Shuo Tang, Rui Ge, Yaxin Du, Zhenfei Yin, Siheng Chen, Jing Shao
Title: MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems
Abstract:
LLM-based multi-agent systems (MAS) have shown significant potential in tackling diverse tasks. However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs. In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS. To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs. Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT's high effectiveness, efficiency and strong generalization ability. Code will be available at https://github.com/rui-ye/MAS-GPT.
中文: 本文提出MAS-GPT方法,通过将多智能体系统构建重构为生成式任务并采用可执行代码表示,实现了高效生成查询自适应系统,在多个基准测试中展现出优越性能与泛化能力。
English: This paper introduces MAS-GPT, a streamlined method that reframes multi-agent system design as a generative task using executable code representations, enabling efficient and adaptive generation of query-specific systems with superior performance across benchmarks.

Authors:Bar Karov, Dor Zohar, Yam Marcovitz
Title: Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models
Abstract:
We present Attentive Reasoning Queries (ARQs), a novel structured reasoning approach that significantly improves instruction-following in Large Language Models through domain-specialized reasoning blueprints. While LLMs demonstrate remarkable capabilities across diverse tasks, they often fail to maintain adherence to complex, use-case-specific instructions during multi-turn conversations, presenting challenges for business-critical applications. ARQs address this limitation by guiding LLMs through systematic reasoning steps with targeted queries that reinstate critical instructions and facilitate intermediate reasoning throughout the completion process. In extensive testing within Parlant, our framework for reliable customer-facing agents in which ARQs were born out of necessity, they achieved a 90.2% success rate across 87 test scenarios, outperforming both Chain-of-Thought reasoning (86.1%) and direct response generation (81.5%). ARQs showed particular strength in addressing persistent failure modes like guideline re-application and hallucination prevention. Our analysis also revealed that ARQs can potentially be more computationally efficient than free-form reasoning when carefully designed. These findings demonstrate that structured reasoning approaches provide effective mechanisms for controlling how LLMs process information and make decisions in complex scenarios.
中文摘要:ARQs是一种新颖的结构化推理方法,通过专业推理蓝图指导大语言模型执行系统性推理步骤,在87个测试场景中达成90.2%的成功率,显著提升了复杂指令遵循能力并有效遏制幻觉现象。
English Summary: ARQs are a structured reasoning method that enhances LLMs' instruction adherence in multi-turn conversations by using specialized reasoning blueprints, achieving a 90.2% success rate in tests and outperforming traditional approaches.

Authors:Haoran Fan, Bin Li, Yixuan Weng, Shoujun Zhou
Title: Small but Mighty: Enhancing Time Series Forecasting with Lightweight LLMs
Abstract:
While LLMs have demonstrated remarkable potential in time series forecasting, their practical deployment remains constrained by excessive computational demands and memory footprints. Existing LLM-based approaches typically suffer from three critical limitations: Inefficient parameter utilization in handling numerical time series patterns; Modality misalignment between continuous temporal signals and discrete text embeddings; and Inflexibility for real-time expert knowledge integration. We present SMETimes, the first systematic investigation of sub-3B parameter SLMs for efficient and accurate time series forecasting. Our approach centers on three key innovations: A statistically-enhanced prompting mechanism that bridges numerical time series with textual semantics through descriptive statistical features; A adaptive fusion embedding architecture that aligns temporal patterns with language model token spaces through learnable parameters; And a dynamic mixture-of-experts framework enabled by SLMs' computational efficiency, adaptively combining base predictions with domain-specific models. Extensive evaluations across seven benchmark datasets demonstrate that our 3B-parameter SLM achieves state-of-the-art performance on five primary datasets while maintaining 3.8x faster training and 5.2x lower memory consumption compared to 7B-parameter LLM baselines. Notably, the proposed model exhibits better learning capabilities, achieving 12.3% lower MSE than conventional LLM. Ablation studies validate that our statistical prompting and cross-modal fusion modules respectively contribute 15.7% and 18.2% error reduction in long-horizon forecasting tasks. By redefining the efficiency-accuracy trade-off landscape, this work establishes SLMs as viable alternatives to resource-intensive LLMs for practical time series forecasting. Code and models are available at https://github.com/xiyan1234567/SMETimes.
中文: 该摘要介绍了SMETimes,一种参数少于30亿的小型语言模型,通过统计提示和自适应融合等创新技术,解决了大语言模型在时间序列预测中的计算和内存限制问题,在显著降低资源消耗的同时实现了更优的性能。
English: This abstract introduces SMETimes, a sub-3B parameter small language model that overcomes the computational and memory limitations of large language models in time series forecasting through innovations like statistical prompting and adaptive fusion, achieving superior performance with significantly reduced resource usage.

Authors:Lida Chen, Dong Xu, Chenxin An, Xintao Wang, Yikai Zhang, Jiangjie Chen, Zujie Liang, Feng Wei, Jiaqing Liang, Yanghua Xiao, Wei Wang
Title: PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention
Abstract:
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.
中文总结:PowerAttention是一种新颖的稀疏注意力设计,能够在自回归大语言模型中实现指数级感受野扩展,在保持高计算效率的同时,显著提升了长上下文任务的处理性能。
English Summary: PowerAttention is a novel sparse attention design that enables exponential receptive field growth in autoregressive LLMs, achieving superior performance on long-context tasks while maintaining high computational efficiency.

Authors:Canaan Yung, Hanxun Huang, Sarah Monazam Erfani, Christopher Leckie
Title: CURVALID: Geometrically-guided Adversarial Prompt Detection
Abstract:
Adversarial prompts capable of jailbreaking large language models (LLMs) and inducing undesirable behaviours pose a significant obstacle to their safe deployment. Current mitigation strategies rely on activating built-in defence mechanisms or fine-tuning the LLMs, but the fundamental distinctions between adversarial and benign prompts are yet to be understood. In this work, we introduce CurvaLID, a novel defense framework that efficiently detects adversarial prompts by leveraging their geometric properties. It is agnostic to the type of LLM, offering a unified detection framework across diverse adversarial prompts and LLM architectures. CurvaLID builds on the geometric analysis of text prompts to uncover their underlying differences. We theoretically extend the concept of curvature via the Whewell equation into an $n$-dimensional word embedding space, enabling us to quantify local geometric properties, including semantic shifts and curvature in the underlying manifolds. Additionally, we employ Local Intrinsic Dimensionality (LID) to capture geometric features of text prompts within adversarial subspaces. Our findings reveal that adversarial prompts differ fundamentally from benign prompts in terms of their geometric characteristics. Our results demonstrate that CurvaLID delivers superior detection and rejection of adversarial queries, paving the way for safer LLM deployment. The source code can be found at https://github.com/Cancanxxx/CurvaLID
中文: CurvaLID是一种新型防御框架,通过分析对抗性提示在词嵌入空间中的独特几何特性来检测它们,为不同模型和攻击类型提供了统一的解决方案,以实现更安全的大型语言模型部署。
English: CurvaLID is a novel defense framework that detects adversarial prompts by analyzing their unique geometric properties in word embedding spaces, providing a unified solution for safer LLM deployment across various models and attack types.

Authors:Alessio Galatolo, Zhenbang Dai, Katie Winkle, Meriem Beloucif
Title: Visualising Policy-Reward Interplay to Inform Zeroth-Order Preference Optimisation of Large Language Models
Abstract:
Fine-tuning Large Language Models (LLMs) with first-order methods like back-propagation is computationally intensive. Zeroth-Order (ZO) optimisation uses function evaluations instead of gradients, reducing memory usage, but suffers from slow convergence in high-dimensional models. As a result, ZO research in LLMs has mostly focused on classification, overlooking more complex generative tasks. In this paper, we introduce ZOPrO, a novel ZO algorithm designed for Preference Optimisation in LLMs. We begin by analysing the interplay between policy and reward models during traditional (first-order) Preference Optimisation, uncovering patterns in their relative updates. Guided by these insights, we adapt Simultaneous Perturbation Stochastic Approximation (SPSA) with a targeted sampling strategy to accelerate convergence. Through experiments on summarisation, machine translation, and conversational assistants, we demonstrate that our method consistently enhances reward signals while achieving convergence times comparable to first-order methods. While it falls short of some state-of-the-art methods, our work is the first to apply Zeroth-Order methods to Preference Optimisation in LLMs, going beyond classification tasks and paving the way for a largely unexplored research direction. Code and visualisations are available at https://github.com/alessioGalatolo/VisZOPrO
Chinese Summary: 本文提出了ZOPrO,一种用于大型语言模型偏好优化的新型零阶优化算法,在增强奖励信号的同时实现了与一阶方法相当的收敛速度,并开创了零阶方法在分类任务之外的应用。
English Summary: This paper introduces ZOPrO, a novel zeroth-order optimization algorithm for preference optimization in large language models, which enhances reward signals and achieves competitive convergence times while pioneering the application of zeroth-order methods beyond classification tasks.

Authors:Jabez Magomere, Emanuele La Malfa, Manuel Tonneau, Ashkan Kazemi, Scott Hale
Title: When Claims Evolve: Evaluating and Enhancing the Robustness of Embedding Models Against Misinformation Edits
Abstract:
Online misinformation remains a critical challenge, and fact-checkers increasingly rely on claim matching systems that use sentence embedding models to retrieve relevant fact-checks. However, as users interact with claims online, they often introduce edits, and it remains unclear whether current embedding models used in retrieval are robust to such edits. To investigate this, we introduce a perturbation framework that generates valid and natural claim variations, enabling us to assess the robustness of a wide-range of sentence embedding models in a multi-stage retrieval pipeline and evaluate the effectiveness of various mitigation approaches. Our evaluation reveals that standard embedding models exhibit notable performance drops on edited claims, while LLM-distilled embedding models offer improved robustness at a higher computational cost. Although a strong reranker helps to reduce the performance drop, it cannot fully compensate for first-stage retrieval gaps. To address these retrieval gaps, we evaluate train- and inference-time mitigation approaches, demonstrating that they can improve in-domain robustness by up to 17 percentage points and boost out-of-domain generalization by 10 percentage points. Overall, our findings provide practical improvements to claim-matching systems, enabling more reliable fact-checking of evolving misinformation. Code and data are available at https://github.com/JabezNzomo99/claim-matching-robustness.
中文: 当前用于事实核查的声明匹配系统中的句子嵌入模型对编辑后的声明表现出显著脆弱性,但基于大语言模型提炼的嵌入模型及针对性改进策略能显著提升其鲁棒性和泛化能力。
English: Current sentence embedding models used in claim-matching systems for fact-checking show significant vulnerability to edited claims, but LLM-distilled models and targeted mitigation strategies can substantially enhance robustness and generalization.

Authors:Xi Zhu, Haochen Xue, Ziwei Zhao, Wujiang Xu, Jingyuan Huang, Minghao Guo, Qifan Wang, Kaixiong Zhou, Yongfeng Zhang
Title: LLM as GNN: Graph Vocabulary Learning for Text-Attributed Graph Foundation Models
Abstract:
Text-Attributed Graphs (TAGs), where each node is associated with text descriptions, are ubiquitous in real-world scenarios. They typically exhibit distinctive structure and domain-specific knowledge, motivating the development of a Graph Foundation Model (GFM) that generalizes across diverse graphs and tasks. Despite large efforts to integrate Large Language Models (LLMs) and Graph Neural Networks (GNNs) for TAGs, existing approaches suffer from decoupled architectures with two-stage alignment, limiting their synergistic potential. Even worse, existing methods assign out-of-vocabulary (OOV) tokens to graph nodes, leading to graph-specific semantics, token explosion, and incompatibility with task-oriented prompt templates, which hinders cross-graph and cross-task transferability. To address these challenges, we propose PromptGFM, a versatile GFM for TAGs grounded in graph vocabulary learning. PromptGFM comprises two key components: (1) Graph Understanding Module, which explicitly prompts LLMs to replicate the finest GNN workflow within the text space, facilitating seamless GNN-LLM integration and elegant graph-text alignment; (2) Graph Inference Module, which establishes a language-based graph vocabulary ensuring expressiveness, transferability, and scalability, enabling readable instructions for LLM fine-tuning. Extensive experiments demonstrate our superiority and transferability across diverse graphs and tasks. The code is available at this: https://github.com/agiresearch/PromptGFM.
中文: PromptGFM是一种基于图词汇学习的通用图基础模型,通过图理解模块和图推理模块无缝融合大语言模型与图神经网络,显著提升了跨图和跨任务的迁移性能。
English: PromptGFM is a versatile Graph Foundation Model that integrates Large Language Models and Graph Neural Networks through graph vocabulary learning to enhance cross-graph and cross-task transferability.

Authors:Jie He, Tao Wang, Deyi Xiong, Qun Liu
Title: The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation
Abstract:
Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy (60.1%) and reasoning consistency (31%). The built commonsense test suite is available at https://github.com/tjunlp-lab/CommonMT.
Chinese: 本文提出了一套评估神经机器翻译常识推理能力的测试集,结果显示其在三种歧义类型上的表现较差,准确率仅为60.1%,一致性为31%。
English: This paper introduces a test suite to assess neural machine translation's commonsense reasoning, revealing its poor performance with only 60.1% accuracy and 31% consistency across three ambiguity types.

Authors:Gabriele Sarti, Vilém Zouhar, Grzegorz Chrupała, Ana Guerberof-Arenas, Malvina Nissim, Arianna Bisazza
Title: QE4PE: Word-level Quality Estimation for Human Post-Editing
Abstract:
Word-level quality estimation (QE) methods aim to detect erroneous spans in machine translations, which can direct and facilitate human post-editing. While the accuracy of word-level QE systems has been assessed extensively, their usability and downstream influence on the speed, quality and editing choices of human post-editing remain understudied. In this study, we investigate the impact of word-level QE on machine translation (MT) post-editing in a realistic setting involving 42 professional post-editors across two translation directions. We compare four error-span highlight modalities, including supervised and uncertainty-based word-level QE methods, for identifying potential errors in the outputs of a state-of-the-art neural MT model. Post-editing effort and productivity are estimated from behavioral logs, while quality improvements are assessed by word- and segment-level human annotation. We find that domain, language and editors' speed are critical factors in determining highlights' effectiveness, with modest differences between human-made and automated QE highlights underlining a gap between accuracy and usability in professional workflows.
中文: 词级质量评估旨在识别机器翻译中的错误以辅助人工后编辑,但其对编辑效率和质量的实用影响尚待深入研究,其中领域和编辑速度等因素比高亮来源更能决定其有效性。
English: Word-level quality estimation aids in identifying machine translation errors to assist human post-editing, yet its practical impact on editing efficiency and quality remains underexplored, with factors like domain and editor speed influencing highlight effectiveness more than the source of the highlights themselves.

Authors:Yizhe Zhang, Navdeep Jaitly
Title: SAGE: Steering Dialog Generation with Future-Aware State-Action Augmentation
Abstract:
Recent advances in large language models have demonstrated impressive capabilities in task-oriented applications, yet building emotionally intelligent chatbots that can engage in natural, strategic conversations remains a challenge. We present a novel approach called SAGE that uses latent variables to control long-horizon behavior in dialogue generation. At the core of our method is the State-Action Chain (SAC), which augments standard language model fine-tuning by introducing latent variables that encapsulate emotional states and conversational strategies between dialogue turns. During inference, these variables are generated before each response, enabling coarse-grained control over dialogue progression while maintaining natural interaction patterns. We also introduce a self-improvement pipeline that leverages dialogue tree search, LLM-based reward modeling, and targeted fine-tuning to optimize conversational trajectories. Our experimental results show that models trained with this approach demonstrate improved performance in emotional intelligence metrics while maintaining strong capabilities on LLM benchmarks. The discrete nature of our latent variables facilitates search-based strategies and provides a foundation for future applications of reinforcement learning to dialogue systems, where learning can occur at the state level rather than the token level. https://github.com/apple/ml-sage-dialog-gen
中文摘要:SAGE框架通过引入潜在变量来控制对话生成中的情感状态和会话策略,在保持自然交互和语言模型基准性能的同时,显著提升了聊天机器人的情感智能表现。
English Summary: The SAGE framework introduces latent variables to control emotional states and conversational strategies in dialogue generation, enabling improved emotional intelligence in chatbots while maintaining natural interaction and strong performance on language model benchmarks.

Authors:Siqi Ouyang, Xi Xu, Lei Li
Title: InfiniSST: Simultaneous Translation of Unbounded Speech with Large Language Model
Abstract:
Simultaneous translation of unbounded streaming speech remains a challenging problem due to the need for effectively processing the history speech context and past translations so that quality and latency, including computation overhead, can be balanced. Most prior works assume pre-segmented speech, limiting their real-world applicability. In this paper, we propose InfiniSST, a novel approach that formulates SST as a multi-turn dialogue task, enabling seamless translation of unbounded speech. We construct translation trajectories and robust segments from MuST-C with multi-latency augmentation during training and develop a key-value (KV) cache management strategy to facilitate efficient inference. Experiments on MuST-C En-Es, En-De, and En-Zh demonstrate that InfiniSST reduces computation-aware latency by 0.5 to 1 second while maintaining the same translation quality compared to baselines. Ablation studies further validate the contributions of our data construction and cache management strategy. We release the code and demo at https://github.com/LeiLiLab/InfiniSST
中文:InfiniSST提出了一种新颖的多轮对话方法用于无边界流式语音同传,通过优化的数据构建和缓存管理策略,在保持翻译质量的同时将延迟降低了0.5-1秒。
English: InfiniSST introduces a novel multi-turn dialogue approach for simultaneous streaming speech translation, reducing latency by 0.5-1 seconds while maintaining quality through optimized data construction and cache management.

Authors:Danqing Zhang, Balaji Rama, Jingyi Ni, Shiying He, Fu Zhao, Kunyu Chen, Arnold Chen, Junyu Cao
Title: LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications
Abstract:
We introduce LiteWebAgent, an open-source suite for VLM-based web agent applications. Our framework addresses a critical gap in the web agent ecosystem with a production-ready solution that combines minimal serverless backend configuration, intuitive user and browser interfaces, and extensible research capabilities in agent planning, memory, and tree search. For the core LiteWebAgent agent framework, we implemented a simple yet effective baseline using recursive function calling, providing with decoupled action generation and action grounding. In addition, we integrate advanced research components such as agent planning, agent workflow memory, and tree search in a modular and extensible manner. We then integrate the LiteWebAgent agent framework with frontend and backend as deployed systems in two formats: (1) a production Vercel-based web application, which provides users with an agent-controlled remote browser, (2) a Chrome extension leveraging LiteWebAgent's API to control an existing Chrome browser via CDP (Chrome DevTools Protocol). The LiteWebAgent framework is available at https://github.com/PathOnAI/LiteWebAgent, with deployed frontend at https://lite-web-agent.vercel.app/.
中文:LiteWebAgent是一个基于VLM的开源网页代理框架,提供生产就绪的解决方案,具备简化的服务器配置、直观的用户界面以及可扩展的研究功能,如代理规划和树搜索。
English: LiteWebAgent is an open-source framework for VLM-based web agents that offers a production-ready solution with minimal serversetups, user-friendly interfaces, and extensible research features like planning and tree search.

Authors:Xuan Cai, Xuesong Bai, Zhiyong Cui, Danmu Xie, Daocheng Fu, Haiyang Yu, Yilong Ren
Title: Text2Scenario: Text-Driven Scenario Generation for Autonomous Driving Test
Abstract:
Autonomous driving (AD) testing constitutes a critical methodology for assessing performance benchmarks prior to product deployment. The creation of segmented scenarios within a simulated environment is acknowledged as a robust and effective strategy; however, the process of tailoring these scenarios often necessitates laborious and time-consuming manual efforts, thereby hindering the development and implementation of AD technologies. In response to this challenge, we introduce Text2Scenario, a framework that leverages a Large Language Model (LLM) to autonomously generate simulation test scenarios that closely align with user specifications, derived from their natural language inputs. Specifically, an LLM, equipped with a meticulously engineered input prompt scheme functions as a text parser for test scenario descriptions, extracting from a hierarchically organized scenario repository the components that most accurately reflect the user's preferences. Subsequently, by exploiting the precedence of scenario components, the process involves sequentially matching and linking scenario representations within a Domain Specific Language corpus, ultimately fabricating executable test scenarios. The experimental results demonstrate that such prompt engineering can meticulously extract the nuanced details of scenario elements embedded within various descriptive formats, with the majority of generated scenarios aligning closely with the user's initial expectations, allowing for the efficient and precise evaluation of diverse AD stacks void of the labor-intensive need for manual scenario configuration. Project page: https://caixxuan.github.io/Text2Scenario.GitHub.io.

Authors:Siming Huang, Yuliang Xu, Mingmeng Geng, Yao Wan, Dongping Chen
Title: Wikipedia in the Era of LLMs: Evolution and Risks
Abstract:
In this paper, we present a thorough analysis of the impact of Large Language Models (LLMs) on Wikipedia, examining the evolution of Wikipedia through existing data and using simulations to explore potential risks. We begin by analyzing page views and article content to study Wikipedia's recent changes and assess the impact of LLMs. Subsequently, we evaluate how LLMs affect various Natural Language Processing (NLP) tasks related to Wikipedia, including machine translation and retrieval-augmented generation (RAG). Our findings and simulation results reveal that Wikipedia articles have been influenced by LLMs, with an impact of approximately 1%-2% in certain categories. If the machine translation benchmark based on Wikipedia is influenced by LLMs, the scores of the models may become inflated, and the comparative results among models might shift as well. Moreover, the effectiveness of RAG might decrease if the knowledge base becomes polluted by LLM-generated content. While LLMs have not yet fully changed Wikipedia's language and knowledge structures, we believe that our empirical findings signal the need for careful consideration of potential future risks.
本研究分析了大语言模型对维基百科的影响,显示某些类别受到约1%-2%的影响,并指出因内容污染可能导致基准测试分数虚高及检索增强生成效果下降等风险。
This study analyzes how Large Language Models are influencing Wikipedia, showing a 1%-2% impact in some areas and highlighting risks like inflated benchmark scores and reduced effectiveness of retrieval-augmented generation due to potential content pollution.

Authors:Shaina Raza, Mukund Sayeeganesh Chettiar, Matin Yousefabadi, Tahniat Khan, Marcelo Lotif
Title: FairSense-AI: Responsible AI Meets Sustainability
Abstract:
In this paper, we introduce FairSense-AI: a multimodal framework designed to detect and mitigate bias in both text and images. By leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), FairSense-AI uncovers subtle forms of prejudice or stereotyping that can appear in content, providing users with bias scores, explanatory highlights, and automated recommendations for fairness enhancements. In addition, FairSense-AI integrates an AI risk assessment component that aligns with frameworks like the MIT AI Risk Repository and NIST AI Risk Management Framework, enabling structured identification of ethical and safety concerns. The platform is optimized for energy efficiency via techniques such as model pruning and mixed-precision computation, thereby reducing its environmental footprint. Through a series of case studies and applications, we demonstrate how FairSense-AI promotes responsible AI use by addressing both the social dimension of fairness and the pressing need for sustainability in large-scale AI deployments. https://vectorinstitute.github.io/FairSense-AI, https://pypi.org/project/fair-sense-ai/ (Sustainability , Responsible AI , Large Language Models , Vision Language Models , Ethical AI , Green AI)

Authors:Belinda Z. Li, Zifan Carl Guo, Jacob Andreas
Title: (How) Do Language Models Track State?
Abstract:
Transformer language models (LMs) exhibit behaviors -- from storytelling to code generation -- that appear to require tracking the unobserved state of an evolving world. How do they do so? We study state tracking in LMs trained or fine-tuned to compose permutations (i.e., to compute the order of a set of objects after a sequence of swaps). Despite the simple algebraic structure of this problem, many other tasks (e.g., simulation of finite automata and evaluation of boolean expressions) can be reduced to permutation composition, making it a natural model for state tracking in general. We show that LMs consistently learn one of two state tracking mechanisms for this task. The first closely resembles the "associative scan" construction used in recent theoretical work by Liu et al. (2023) and Merrill et al. (2024). The second uses an easy-to-compute feature (permutation parity) to partially prune the space of outputs, then refines this with an associative scan. The two mechanisms exhibit markedly different robustness properties, and we show how to steer LMs toward one or the other with intermediate training tasks that encourage or suppress the heuristics. Our results demonstrate that transformer LMs, whether pretrained or fine-tuned, can learn to implement efficient and interpretable state tracking mechanisms, and the emergence of these mechanisms can be predicted and controlled.
中文: 变换器语言模型能够学习可解释的状态跟踪机制,如关联扫描和启发式剪枝,用于排列组合等任务,且这些机制的出现可通过训练进行预测和控制。
English: Transformer language models learn interpretable state tracking mechanisms, such as associative scans and heuristic-based pruning, for tasks like permutation composition, and their emergence can be predicted and controlled through training.

Authors:Zicong He, Boxuan Zhang, Lu Cheng
Title: Shakespearean Sparks: The Dance of Hallucination and Creativity in LLMs' Decoding Layers
Abstract:
Large language models (LLMs) are known to hallucinate, a phenomenon often linked to creativity. While previous research has primarily explored this connection through theoretical or qualitative lenses, our work takes a quantitative approach to systematically examine the relationship between hallucination and creativity in LLMs. Given the complex nature of creativity, we propose a narrow definition tailored to LLMs and introduce an evaluation framework, HCL, which quantifies Hallucination and Creativity across different Layers of LLMs during decoding. Our empirical analysis reveals a tradeoff between hallucination and creativity that is consistent across layer depth, model type, and model size. Notably, across different model architectures, we identify a specific layer at each model size that optimally balances this tradeoff. Additionally, the optimal layer tends to appear in the early layers of larger models, and the confidence of the model is also significantly higher at this layer. These findings provide a quantitative perspective that offers new insights into the interplay between LLM creativity and hallucination. The code and data for our experiments are available at https://github.com/ZicongHe2002/HCL-Spark.
中文摘要:本研究通过HCL评估框架对大型语言模型的幻觉与创造力进行定量分析,发现在不同架构和规模的模型中存在特定层能最优平衡二者关系,且较大模型的早期层表现更佳。
English Summary: This study quantitatively analyzes the tradeoff between hallucination and creativity in large language models using the HCL evaluation framework, identifying optimal model layers that balance this relationship across different architectures and sizes.

Authors:Yuzhe Gu, Wenwei Zhang, Chengqi Lyu, Dahua Lin, Kai Chen
Title: Mask-DPO: Generalizable Fine-grained Factuality Alignment of LLMs
Abstract:
Large language models (LLMs) exhibit hallucinations (i.e., unfaithful or nonsensical information) when serving as AI assistants in various domains. Since hallucinations always come with truthful content in the LLM responses, previous factuality alignment methods that conduct response-level preference learning inevitably introduced noises during training. Therefore, this paper proposes a fine-grained factuality alignment method based on Direct Preference Optimization (DPO), called Mask-DPO. Incorporating sentence-level factuality as mask signals, Mask-DPO only learns from factually correct sentences in the preferred samples and prevents the penalty on factual contents in the not preferred samples, which resolves the ambiguity in the preference learning. Extensive experimental results demonstrate that Mask-DPO can significantly improve the factuality of LLMs responses to questions from both in-domain and out-of-domain datasets, although these questions and their corresponding topics are unseen during training. Only trained on the ANAH train set, the score of Llama3.1-8B-Instruct on the ANAH test set is improved from 49.19% to 77.53%, even surpassing the score of Llama3.1-70B-Instruct (53.44%), while its FactScore on the out-of-domain Biography dataset is also improved from 30.29% to 39.39%. We further study the generalization property of Mask-DPO using different training sample scaling strategies and find that scaling the number of topics in the dataset is more effective than the number of questions. We provide a hypothesis of what factual alignment is doing with LLMs, on the implication of this phenomenon, and conduct proof-of-concept experiments to verify it. We hope the method and the findings pave the way for future research on scaling factuality alignment.
中文: 本文提出了一种名为Mask-DPO的细粒度事实对齐方法,通过在训练中专注于句子层面的真实性,有效提升大语言模型回答的准确性,显著改善了其在领域内和领域外数据集上的表现。
English: This paper introduces Mask-DPO, a fine-grained factuality alignment method that enhances the accuracy of large language models by focusing on sentence-level factual correctness during training, significantly improving performance on both in-domain and out-of-domain datasets.

Authors:Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Title: AlignDistil: Token-Level Language Model Alignment as Adaptive Policy Distillation
Abstract:
In modern large language models (LLMs), LLM alignment is of crucial importance and is typically achieved through methods such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). However, in most existing methods for LLM alignment, all tokens in the response are optimized using a sparse, response-level reward or preference annotation. The ignorance of token-level rewards may erroneously punish high-quality tokens or encourage low-quality tokens, resulting in suboptimal performance and slow convergence speed. To address this issue, we propose AlignDistil, an RLHF-equivalent distillation method for token-level reward optimization. Specifically, we introduce the reward learned by DPO into the RLHF objective and theoretically prove the equivalence between this objective and a token-level distillation process, where the teacher distribution linearly combines the logits from the DPO model and a reference model. On this basis, we further bridge the accuracy gap between the reward from the DPO model and the pure reward model, by building a contrastive DPO reward with a normal and a reverse DPO model. Moreover, to avoid under- and over-optimization on different tokens, we design a token adaptive logit extrapolation mechanism to construct an appropriate teacher distribution for each token. Experimental results demonstrate the superiority of our AlignDistil over existing methods and showcase fast convergence due to its token-level distributional reward optimization.
中文摘要:AlignDistil是一种基于令牌级奖励优化的对齐方法,通过将DPO奖励融入RLHF目标并设计自适应对数外推机制,有效提升大语言模型对齐性能并加快收敛速度。
English Summary: AlignDistil is a token-level reward optimization method that enhances LLM alignment by integrating DPO rewards into RLHF objectives, using adaptive token-level distillation to improve performance and accelerate convergence.

Authors:Jie Wu, Haoling Li, Xin Zhang, Xiao Liu, Yangyu Huang, Jianwen Luo, Yizhen Zhang, Zuchao Li, Ruihang Chu, Yujiu Yang, Scarlett Li
Title: Teaching Your Models to Understand Code via Focal Preference Alignment
Abstract:
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
Chinese: Target-DPO提出了一种新的代码大模型偏好对齐框架,通过精确定位错误区域并对其相应标记进行对齐来模拟人类调试过程,从而在代码生成任务中实现显著性能提升并减少错误。
English: Target-DPO introduces a novel preference alignment framework for Code LLMs that mimics human debugging by explicitly locating error regions and aligning corresponding tokens, leading to significant performance gains and fewer errors in code generation tasks.

Authors:Daniil Larionov, Steffen Eger
Title: BatchGEMBA: Token-Efficient Machine Translation Evaluation with Batched Prompting and Prompt Compression
Abstract:
Recent advancements in Large Language Model (LLM)-based Natural Language Generation evaluation have largely focused on single-example prompting, resulting in significant token overhead and computational inefficiencies. In this work, we introduce BatchGEMBA-MQM, a framework that integrates batched prompting with the GEMBA-MQM metric for machine translation evaluation. Our approach aggregates multiple translation examples into a single prompt, reducing token usage by 2-4 times (depending on the batch size) relative to single-example prompting. Furthermore, we propose a batching-aware prompt compression model that achieves an additional token reduction of 13-15% on average while also showing ability to help mitigate batching-induced quality degradation. Evaluations across several LLMs (GPT-4o, GPT-4o-mini, Mistral Small, Phi4, and CommandR7B) and varying batch sizes reveal that while batching generally negatively affects quality (but sometimes not substantially), prompt compression does not degrade further, and in some cases, recovers quality loss. For instance, GPT-4o retains over 90% of its baseline performance at a batch size of 4 when compression is applied, compared to a 44.6% drop without compression. We plan to release our code and trained models at https://github.com/NL2G/batchgemba to support future research in this domain.
Chinese: BatchGEMBA-MQM框架通过批量提示和提示压缩技术,将机器翻译评估的令牌使用量减少2-4倍,在GPT-4o模型批量大小为4时仍保持90%以上基准性能,有效缓解了批量处理导致的质量下降问题。
English: BatchGEMBA-MQM introduces batched prompting and prompt compression to significantly reduce token usage and computational costs in LLM-based machine translation evaluation, maintaining over 90% baseline performance with GPT-4o at batch size 4 despite minor quality impacts from batching.

Authors:Pengwei Tang, Yong Liu, Dongjie Zhang, Xing Wu, Debing Zhang
Title: LoRA-Null: Low-Rank Adaptation via Null Space for Large Language Models
Abstract:
Low-Rank Adaptation (LoRA) is the leading parameter-efficient fine-tuning method for Large Language Models (LLMs). However, the fine-tuned LLMs encounter the issue of catastrophic forgetting of the pre-trained world knowledge. To address this issue, inspired by theoretical insights of null space, we propose LoRA-Null, i.e., Low-Rank Adaptation via null space, which builds adapters initialized from the null space of the pre-trained knowledge activation. Concretely, we randomly collect a few data samples and capture their activations after passing through the LLM layer. We perform Singular Value Decomposition on the input activations to obtain their null space. We use the projection of the pre-trained weights onto the null space as the initialization for adapters. Experimental results demonstrate that this initialization approach can effectively preserve the original pre-trained world knowledge of the LLMs during fine-tuning. Additionally, if we freeze the values of the down-projection matrices during fine-tuning, it achieves even better preservation of the pre-trained world knowledge. LoRA-Null effectively preserves pre-trained world knowledge while maintaining strong fine-tuning performance, as validated by extensive experiments on LLaMA series (LLaMA2, LLaMA3, LLaMA3.1, and LLaMA3.2) across Code, Math, and Instruction Following tasks. We also provide a theoretical guarantee for the capacity of LoRA-Null to retain pre-trained knowledge. Code is in https://github.com/HungerPWAY/LoRA-Null.
中文: LoRA-Null是一种新颖的参数高效微调方法,通过从预训练知识激活的零空间初始化适配器,有效保留原始世界知识,同时在多项任务中保持强大性能。
English: LoRA-Null is a novel parameter-efficient fine-tuning method that initializes adapters from the null space of pre-trained knowledge activations, effectively preserving original world knowledge while maintaining strong performance across various tasks.

Authors:Caiyu Hu, Yikai Zhang, Tinghui Zhu, Yiwei Ye, Yanghua Xiao
Title: MCiteBench: A Multimodal Benchmark for Generating Text with Citations
Abstract:
Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, leaving the challenges of multimodal scenarios largely unexplored. In this paper, we introduce MCiteBench, the first benchmark designed to assess the ability of MLLMs to generate text with citations in multimodal contexts. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. Experimental results reveal that MLLMs struggle to ground their outputs reliably when handling multimodal input. Further analysis uncovers a systematic modality bias and reveals how models internally rely on different sources when generating citations, offering insights into model behavior and guiding future directions for multimodal citation tasks.

Authors:Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang
Title: LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs
Abstract:
Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.
中文: LADM框架通过基于注意力的依赖关系测量高效筛选高质量长文本数据,仅用少量训练标记即可显著提升大语言模型在长文本任务中的表现。
English: The LADM framework efficiently selects high-quality long-context data using attention-based dependency measurement, significantly enhancing LLM performance on long-context tasks with minimal training tokens.

Authors:Yujiao Yang, Jing Lian, Linhui Li
Title: Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Abstract:
Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. Conventional mixture-of-experts (MoE) architectures suffer from suboptimal coordination dynamics, where isolated expert operations expose the model to overfitting risks. Moreover, they have not been effectively extended to attention blocks, which limits further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes the transformer model into an equivalent group of experts and applies a hierarchical routing mechanism to allocate input subspaces to specialized experts. Our approach advances MoE design with four key innovations: (1) Constructing expert groups by partitioning non-MoE models into functionally equivalent specialists (2) Developing a hierarchical routing paradigm that integrates patch-wise data selection and expert selection strategies. (3) Extending the MoE design to attention blocks. (4) Proposing a hardware-optimized parallelization scheme that exploits batched matrix multiplications for efficient expert computation. The experiments demonstrate that our UoE model surpasses Full Attention, state-of-the-art MoEs and efficient transformers in several tasks across image and natural language domains. In language modeling tasks, UoE achieves an average reduction of 2.38 in perplexity compared to the best-performing MoE method with only 76% of its FLOPs. In the Long Range Arena benchmark, it demonstrates an average score at least 0.68% higher than all comparison models, with only 50% of the FLOPs of the best MoE method. In image classification, it yields an average accuracy improvement of 1.75% over the best model while maintaining comparable FLOPs. The source codes are available at https://github.com/YujiaoYang-work/UoE.
中文: 提出的专家联盟(UoE)模型通过引入分层路由机制并将专家设计扩展至注意力模块,解决了传统MoE架构的局限性,在语言和图像任务中以更低计算成本实现了更优性能。
English: The proposed Union-of-Experts (UoE) model addresses limitations in conventional MoE architectures by introducing hierarchical routing and extending expert design to attention blocks, achieving superior performance with reduced computational costs across language and image tasks.

Authors:Sohan Patnaik, Milan Aggarwal, Sumit Bhatia, Balaji Krishnamurthy
Title: It Helps to Take a Second Opinion: Teaching Smaller LLMs to Deliberate Mutually via Selective Rationale Optimisation
Abstract:
Very large language models (LLMs) such as GPT-4 have shown the ability to handle complex tasks by generating and self-refining step-by-step rationales. Smaller language models (SLMs), typically with < 13B parameters, have been improved by using the data generated from very-large LMs through knowledge distillation. However, various practical constraints such as API costs, copyright, legal and ethical policies restrict using large (often opaque) models to train smaller models for commercial use. Limited success has been achieved at improving the ability of an SLM to explore the space of possible rationales and evaluate them by itself through self-deliberation. To address this, we propose COALITION, a trainable framework that facilitates interaction between two variants of the same SLM and trains them to generate and refine rationales optimized for the end-task. The variants exhibit different behaviors to produce a set of diverse candidate rationales during the generation and refinement steps. The model is then trained via Selective Rationale Optimization (SRO) to prefer generating rationale candidates that maximize the likelihood of producing the ground-truth answer. During inference, COALITION employs a controller to select the suitable variant for generating and refining the rationales. On five different datasets covering mathematical problems, commonsense reasoning, and natural language inference, COALITION outperforms several baselines by up to 5%. Our ablation studies reveal that cross-communication between the two variants performs better than using the single model to self-refine the rationales. We also demonstrate the applicability of COALITION for LMs of varying scales (4B to 14B parameters) and model families (Mistral, Llama, Qwen, Phi). We release the code for this work at https://github.com/Sohanpatnaik106/coalition.
Chinese: COALITION是一种可训练框架,通过让同一小型语言模型的两个变体交互生成并优化多样化的推理路径,在五个数据集上比基线模型性能提升高达5%。
English: COALITION is a trainable framework that enables two variants of the same small language model to interact, generating and refining diverse rationales optimized for end-tasks, outperforming baselines by up to 5% across five datasets.

Authors:Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yimeng Bai, Wenjie Wang, Hong Cheng, Fuli Feng, Tat-Seng Chua
Title: Measuring What Makes You Unique: Difference-Aware User Modeling for Enhancing LLM Personalization
Abstract:
Personalizing Large Language Models (LLMs) has become a critical step in facilitating their widespread application to enhance individual life experiences. In pursuit of personalization, distilling key preference information from an individual's historical data as instructional preference context to customize LLM generation has emerged as a promising direction. However, these methods face a fundamental limitation by overlooking the inter-user comparative analysis, which is essential for identifying the inter-user differences that truly shape preferences. To address this limitation, we propose Difference-aware Personalization Learning (DPL), a novel approach that emphasizes extracting inter-user differences to enhance LLM personalization. DPL strategically selects representative users for comparison and establishes a structured standard to extract meaningful, task-relevant differences for customizing LLM generation. Extensive experiments on real-world datasets demonstrate that DPL significantly enhances LLM personalization. We release our code at https://github.com/SnowCharmQ/DPL.
中文摘要:本研究提出差异感知个性化学习(DPL)方法,通过分析用户间差异来增强大语言模型的个性化定制,有效解决了现有方法忽略用户对比分析的局限性。
English Summary: The study introduces Difference-aware Personalization Learning (DPL), a method that improves LLM personalization by analyzing inter-user differences, which overcomes the limitation of prior approaches that neglect comparative user analysis.

Authors:Wei Sun, Qianlong Du, Fuwei Cui, Jiajun Zhang
Title: An Efficient and Precise Training Data Construction Framework for Process-supervised Reward Model in Mathematical Reasoning
Abstract:
Enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) is of great scientific and practical significance. Researchers typically employ process-supervised reward models (PRMs) to guide the reasoning process, effectively improving the models' reasoning abilities. However, existing methods for constructing process supervision training data, such as manual annotation and per-step Monte Carlo estimation, are often costly or suffer from poor quality. To address these challenges, this paper introduces a framework called EpicPRM, which annotates each intermediate reasoning step based on its quantified contribution and uses an adaptive binary search algorithm to enhance both annotation precision and efficiency. Using this approach, we efficiently construct a high-quality process supervision training dataset named Epic50k, consisting of 50k annotated intermediate steps. Compared to other publicly available datasets, the PRM trained on Epic50k demonstrates significantly superior performance. Getting Epic50k at https://github.com/xiaolizh1/EpicPRM.
中文: EpicPRM提出了一种创新框架,通过量化每个推理步骤的贡献并采用自适应二分搜索算法,高效构建了高质量的Epic50k过程监督训练数据集,相比现有方法显著提升了PRM模型的性能表现。
English: EpicPRM introduces a novel framework that efficiently constructs high-quality process supervision training data by quantifying each reasoning step's contribution and using adaptive binary search, resulting in the Epic50k dataset which significantly enhances PRM performance compared to existing methods.

Authors:Xinyu Wang, Bohan Zhuang, Qi Wu
Title: Are Large Vision Language Models Good Game Players?
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable abilities in understanding and reasoning about both visual and textual information. However, existing evaluation methods for LVLMs, primarily based on benchmarks like Visual Question Answering and image captioning, often fail to capture the full scope of LVLMs' capabilities. These benchmarks are limited by issues such as inadequate assessment of detailed visual perception, data contamination, and a lack of focus on multi-turn reasoning. To address these challenges, we propose \method{}, a game-based evaluation framework designed to provide a comprehensive assessment of LVLMs' cognitive and reasoning skills in structured environments. \method{} uses a set of games to evaluate LVLMs on four core tasks: Perceiving, Question Answering, Rule Following, and End-to-End Playing, with each target task designed to assess specific abilities, including visual perception, reasoning, decision-making, etc. Based on this framework, we conduct extensive experiments that explore the limitations of current LVLMs, such as handling long structured outputs and perceiving detailed and dense elements. Code and data are publicly available at https://github.com/xinke-wang/LVLM-Playground.
中文: 作者提出了\method{},一种基于游戏的评估框架,通过感知、问答、规则遵循和端到端游戏四个核心任务全面评估大型视觉语言模型的认知与推理能力,解决了现有基准在详细视觉感知评估和多轮推理关注不足等局限性。
English: The authors introduce \method{}, a game-based evaluation framework that comprehensively assesses Large Vision Language Models' cognitive and reasoning abilities through four core tasks, addressing limitations in current benchmarks like inadequate visual perception assessment and lack of multi-turn reasoning focus.

Authors:Yunzhen He, Yusuke Takase, Yoichi Ishibashi, Hidetoshi Shimodaira
Title: DeLTa: A Decoding Strategy based on Logit Trajectory Prediction Improves Factuality and Reasoning Ability
Abstract:
Large Language Models (LLMs) are increasingly being used in real-world applications. However, concerns about the reliability of the content they generate persist, as it frequently deviates from factual correctness or exhibits deficiencies in logical reasoning. This paper proposes a novel decoding strategy aimed at enhancing both factual accuracy and inferential reasoning without requiring any modifications to the architecture or pre-trained parameters of LLMs. Our approach adjusts next-token probabilities by analyzing the trajectory of logits from lower to higher layers in Transformers and applying linear regression. We find that this Decoding by Logit Trajectory-based approach (DeLTa) effectively reinforces factuality and reasoning while mitigating incorrect generation. Experiments on TruthfulQA demonstrate that DeLTa attains up to a 4.9% improvement over the baseline. Furthermore, it enhances performance by up to 8.1% on StrategyQA and 7.3% on GSM8K, both of which demand strong reasoning capabilities.
Chinese: 本文提出DeLTa解码策略,通过基于对数轨迹分析调整令牌概率来增强大语言模型的事实准确性和推理能力,在不改变模型架构的情况下,在推理任务上实现了高达8.1%的性能提升。
English: This paper introduces DeLTa, a novel decoding strategy that enhances the factual accuracy and reasoning capabilities of Large Language Models by adjusting token probabilities based on logit trajectory analysis, achieving improvements of up to 8.1% on reasoning tasks without modifying model architecture.

Authors:Xueliang Zhao, Wei Wu, Jian Guan, Lingpeng Kong
Title: PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models
Abstract:
The ability of large language models to solve complex mathematical problems has progressed significantly, particularly for tasks requiring advanced reasoning. However, the scarcity of sufficiently challenging problems, particularly at the Olympiad level, hinders further advancements. In this work, we introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction, emulating the thought processes of experienced problem designers. We provide a theoretical analysis demonstrating that an optimal rationale should maximize both the likelihood of rationale generation given the associated concepts and the likelihood of problem generation conditioned on both the rationale and the concepts. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods. Furthermore, we demonstrate that PromptCoT exhibits superior data scalability, consistently maintaining high performance as the dataset size increases, outperforming the baselines. The implementation is available at https://github.com/zhaoxlpku/PromptCoT.
Chinese: PromptCoT提出了一种模拟专家推理自动生成高质量奥林匹克数学题的新方法,在基准测试中优于现有方法,并展现出随数据增长更优的扩展性。
English: PromptCoT introduces a novel method for automatically generating high-quality Olympiad-level math problems by emulating expert reasoning, which outperforms existing approaches on benchmarks and demonstrates superior scalability with increasing data.

Authors:Zirui Wu, Xiao Liu, Jiayi Li, Lingpeng Kong, Yansong Feng
Title: Haste Makes Waste: Evaluating Planning Abilities of LLMs for Efficient and Feasible Multitasking with Time Constraints Between Actions
Abstract:
While Large Language Model-based agents have demonstrated substantial progress in task completion, existing evaluation benchmarks tend to overemphasize single-task performance, with insufficient attention given to the crucial aspects of multitask planning and execution efficiency required in real-world scenarios. To bridge this gap, we present Recipe2Plan, a novel benchmark framework based on real-world cooking scenarios. Unlike conventional benchmarks, Recipe2Plan challenges agents to optimize cooking time through parallel task execution while respecting temporal constraints i.e. specific actions need to be performed within a particular time intervals following the preceding steps. Overly aggressive local parallelization may disrupt this constraint, potentially compromising the entire cooking process. This strict time constraint between actions raises a unique challenge for agents to balance between maximizing concurrent operations and adhering to critical timing constraints. Extensive experiments with state-of-the-art models reveal challenges in maintaining this balance between efficiency and feasibility. The results highlight the need for improved temporal awareness and global multitasking capabilities in large language models. We open-source our benchmark and code at https://github.com/WilliamZR/Recipe2Plan.
中文: Recipe2Plan基准通过真实烹饪场景提出创新框架,评估智能体的多任务规划与执行效率,揭示了在并行优化与时间约束间保持平衡的挑战,并强调了大语言模型需提升时间感知能力。
English: The Recipe2Plan benchmark introduces a novel framework based on cooking scenarios to evaluate agents' multitask planning and execution efficiency, revealing challenges in balancing parallelization with temporal constraints and highlighting the need for improved temporal awareness in large language models.

Authors:Zhixuan Lin, Evgenii Nikishin, Xu Owen He, Aaron Courville
Title: Forgetting Transformer: Softmax Attention with a Forget Gate
Abstract:
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.
Chinese: 遗忘变换器(FoX)通过数据依赖性地降低未归一化注意力分数,将遗忘门自然融入变换器中,在长上下文语言建模和短上下文任务中表现卓越,同时与FlashAttention算法兼容且无需位置嵌入。
English: The Forgetting Transformer (FoX) incorporates a forget gate into Transformers by adaptively down-weighting attention scores, achieving superior performance in long-context language modeling and short-context tasks while maintaining compatibility with FlashAttention and eliminating the need for positional embeddings.

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Title: Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval
Abstract:
Cross-modal retrieval is gaining increasing efficacy and interest from the research community, thanks to large-scale training, novel architectural and learning designs, and its application in LLMs and multimodal LLMs. In this paper, we move a step forward and design an approach that allows for multimodal queries, composed of both an image and a text, and can search within collections of multimodal documents, where images and text are interleaved. Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones, both at the query and document side. To allow for multi-level and cross-modal understanding and feature extraction, ReT employs a novel Transformer-based recurrent cell that integrates both textual and visual features at different layers, and leverages sigmoidal gates inspired by the classical design of LSTMs. Extensive experiments on M2KR and M-BEIR benchmarks show that ReT achieves state-of-the-art performance across diverse settings. Our source code and trained models are publicly available at https://github.com/aimagelab/ReT.
Chinese: 本文提出ReT模型,采用多模态查询和文档,通过基于Transformer的循环单元实现多层次特征融合,在多个基准测试中取得了领先性能。
English: The paper introduces ReT, a novel cross-modal retrieval model that uses multimodal queries and documents with a Transformer-based recurrent cell for enhanced feature integration, achieving state-of-the-art results on benchmarks.

Authors:Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, Jiaxuan You
Title: MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities as autonomous agents, yet existing benchmarks either focus on single-agent tasks or are confined to narrow domains, failing to capture the dynamics of multi-agent coordination and competition. In this paper, we introduce MultiAgentBench, a comprehensive benchmark designed to evaluate LLM-based multi-agent systems across diverse, interactive scenarios. Our framework measures not only task completion but also the quality of collaboration and competition using novel, milestone-based key performance indicators. Moreover, we evaluate various coordination protocols (including star, chain, tree, and graph topologies) and innovative strategies such as group discussion and cognitive planning. Notably, gpt-4o-mini reaches the average highest task score, graph structure performs the best among coordination protocols in the research scenario, and cognitive planning improves milestone achievement rates by 3%. Code and datasets are public available at https://github.com/MultiagentBench/MARBLE.
中文摘要:该摘要介绍了MultiAgentBench这一新基准,用于评估多智能体大语言模型在多样化交互场景中的表现,衡量协作、竞争及多种协调协议,研究发现GPT-4o-mini获得最高任务分且认知规划使里程碑达成率提升3%。
English Summary: The abstract introduces MultiAgentBench, a new benchmark for evaluating multi-agent LLM systems across diverse interactive scenarios, measuring collaboration, competition, and various coordination protocols, with findings showing GPT-4o-mini achieving the highest task scores and cognitive planning boosting milestone completion by 3%.

Authors:Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang
Title: EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test
Abstract:
The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. In the SGLang framework, EAGLE-3 achieves a 1.38x throughput improvement at a batch size of 64. The code is available at https://github.com/SafeAILab/EAGLE.
中文: EAGLE-3通过用直接标记预测取代特征预测,并采用多层特征融合技术,实现了最高6.5倍的加速比和吞吐量提升,同时能充分利用扩展训练数据的优势。
English: EAGLE-3 enhances speculative sampling by replacing feature prediction with direct token prediction and multi-layer feature fusion, achieving up to 6.5x speedup and improved throughput while benefiting fully from scaled training data.

Authors:Yisen Li, Lingfeng Yang, Wenxuan Shen, Pan Zhou, Yao Wan, Weiwei Lin, Dongping Chen
Title: CrowdSelect: Synthetic Instruction Data Selection with Multi-LLM Wisdom
Abstract:
Distilling advanced Large Language Models' instruction-following capabilities into smaller models using a selected subset has become a mainstream approach in model training. While existing synthetic instruction data selection strategies rely mainly on single-dimensional signals (i.e., reward scores, model perplexity), they fail to capture the complexity of instruction-following across diverse fields. Therefore, we investigate more diverse signals to capture comprehensive instruction-response pair characteristics and propose three foundational metrics that leverage Multi-LLM wisdom, informed by (1) diverse LLM responses and (2) reward model assessment. Building upon base metrics, we propose CrowdSelect, an integrated metric incorporating a clustering-based approach to maintain response diversity. Our comprehensive experiments demonstrate that our foundation metrics consistently improve performance across 4 base models on MT-bench and Arena-Hard. CrowdSelect, efficiently incorporating all metrics, achieves state-of-the-art performance in both Full and LoRA fine-tuning, showing improvements of 4.81% on Arena-Hard and 11.1% on MT-bench with Llama-3.2-3b-instruct. We hope our findings will bring valuable insights for future research in this direction. Code are available at https://github.com/listentm/crowdselect.
Chinese: 本研究提出CrowdSelect方法,通过整合多维指标和聚类技术来优化小型语言模型的指令跟随能力,在MT-bench和Arena-Hard基准测试中实现了最先进的性能表现。
English: This study introduces CrowdSelect, an innovative data selection method that enhances smaller language models by integrating multi-dimensional metrics and clustering to improve instruction-following performance, achieving state-of-the-art results on benchmarks like MT-bench and Arena-Hard.

Authors:Yi-Lin Sung, Prateek Yadav, Jialu Li, Jaehong Yoon, Mohit Bansal
Title: RSQ: Learning from Important Tokens Leads to Better Quantized LLMs
Abstract:
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining. Previous methods typically quantize the weights of each layer by "uniformly" optimizing the layer reconstruction loss across all output tokens. However, in this paper, we demonstrate that better-quantized models can be obtained by prioritizing learning from important tokens (e.g. which have large attention scores). Building on this finding, we propose RSQ (Rotate, Scale, then Quantize), which (1) applies rotations (orthogonal transformation) to the model to mitigate outliers (those with exceptionally large magnitude), (2) scales the token feature based on its importance, and (3) quantizes the model using the GPTQ framework with the second-order statistics computed by scaled tokens. To compute token importance, we explore both heuristic and dynamic strategies. Based on a thorough analysis of all approaches, we adopt attention concentration, which uses attention scores of each token as its importance, as the best approach. We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families: LLaMA3, Mistral, and Qwen2.5. Additionally, models quantized with RSQ achieve superior performance on long-context tasks, further highlighting its effectiveness. Lastly, RSQ demonstrates generalizability across various setups, including different model sizes, calibration datasets, bit precisions, and quantization methods.
中文: RSQ通过旋转和缩放优先处理重要标记来优化模型量化,在多项任务和模型中表现优于基线方法,具备卓越的长上下文处理能力和广泛的适用性。
English: RSQ enhances model quantization by prioritizing important tokens through rotation and scaling, outperforming baselines across multiple tasks and models with superior long-context performance and broad generalizability.

Authors:Hamish Ivison, Muru Zhang, Faeze Brahman, Pang Wei Koh, Pradeep Dasigi
Title: Large-Scale Data Selection for Instruction Tuning
Abstract:
Selecting high-quality training data from a larger pool is a crucial step when instruction-tuning language models, as carefully curated datasets often produce models that outperform those trained on much larger, noisier datasets. Automated data selection approaches for instruction-tuning are typically tested by selecting small datasets (roughly 10k samples) from small pools (100-200k samples). However, popular deployed instruction-tuned models often train on hundreds of thousands to millions of samples, subsampled from even larger data pools. We present a systematic study of how well data selection methods scale to these settings, selecting up to 2.5M samples from pools of up to 5.8M samples and evaluating across 7 diverse tasks. We show that many recently proposed methods fall short of random selection in this setting (while using more compute), and even decline in performance when given access to larger pools of data to select over. However, we find that a variant of representation-based data selection (RDS+), which uses weighted mean pooling of pretrained LM hidden states, consistently outperforms more complex methods across all settings tested -- all whilst being more compute-efficient. Our findings highlight that the scaling properties of proposed automated selection methods should be more closely examined. We release our code, data, and models at https://github.com/hamishivi/automated-instruction-selection.
Chinese: 研究表明,尽管许多自动数据选择方法在扩展到数百万样本时表现不如随机选择甚至性能下降,但一种基于表征的高效计算方法(RDS+)在多样化任务中始终表现优异。
English: This study demonstrates that while many automated data selection methods fail to outperform random selection and even degrade with larger data pools, a compute-efficient representation-based approach (RDS+) consistently excels across diverse tasks when scaling to millions of samples.

Authors:Shiqi Chen, Tongyao Zhu, Ruochen Zhou, Jinghan Zhang, Siyang Gao, Juan Carlos Niebles, Mor Geva, Junxian He, Jiajun Wu, Manling Li
Title: Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas
Abstract:
Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes at https://github.com/shiqichen17/AdaptVis.
中文摘要:大型视觉语言模型在空间推理任务中表现不佳,而无需训练的ADAPTVIS方法通过动态调整注意力机制,在空间推理基准测试中实现了高达50%的性能提升。
English Summary: Large Vision Language Models face significant challenges in spatial reasoning, but a new training-free method called ADAPTVIS dynamically adjusts attention based on confidence, achieving up to 50% improvement on benchmarks.

Authors:Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
Title: Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Abstract:
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.
中文: 工具学习旨在增强大语言模型使用工具解决实际任务的能力,但现有基准简化了工具检索步骤,因此提出ToolRet基准以评估并提升检索模型在真实场景中的性能。
English: Tool learning enhances large language models with tools for practical tasks, but current benchmarks overlook realistic tool retrieval challenges, prompting the creation of ToolRet to evaluate and improve retrieval models' performance.

Authors:Chenxi Wang, Tianle Gu, Zhongyu Wei, Lang Gao, Zirui Song, Xiuying Chen
Title: Word Form Matters: LLMs' Semantic Reconstruction under Typoglycemia
Abstract:
Human readers can efficiently comprehend scrambled words, a phenomenon known as Typoglycemia, primarily by relying on word form; if word form alone is insufficient, they further utilize contextual cues for interpretation. While advanced large language models (LLMs) exhibit similar abilities, the underlying mechanisms remain unclear. To investigate this, we conduct controlled experiments to analyze the roles of word form and contextual information in semantic reconstruction and examine LLM attention patterns. Specifically, we first propose SemRecScore, a reliable metric to quantify the degree of semantic reconstruction, and validate its effectiveness. Using this metric, we study how word form and contextual information influence LLMs' semantic reconstruction ability, identifying word form as the core factor in this process. Furthermore, we analyze how LLMs utilize word form and find that they rely on specialized attention heads to extract and process word form information, with this mechanism remaining stable across varying levels of word scrambling. This distinction between LLMs' fixed attention patterns primarily focused on word form and human readers' adaptive strategy in balancing word form and contextual information provides insights into enhancing LLM performance by incorporating human-like, context-aware mechanisms.
Chinese: 研究表明,大型语言模型主要依赖固定的注意力模式处理词形进行语义重建,而人类则能灵活结合词形与上下文;通过融入类人的语境感知机制,有望提升模型的性能。
English: This study reveals that while large language models (LLMs) primarily rely on fixed attention patterns focused on word form for semantic reconstruction—unlike humans who adaptively balance word form and context—integrating human-like contextual mechanisms could enhance LLM performance.

Authors:Yongchao Chen, Yilun Hao, Yang Zhang, Chuchu Fan
Title: Code-as-Symbolic-Planner: Foundation Model-Based Robot Planning via Symbolic Code Generation
Abstract:
Recent works have shown great potentials of Large Language Models (LLMs) in robot task and motion planning (TAMP). Current LLM approaches generate text- or code-based reasoning chains with sub-goals and action plans. However, they do not fully leverage LLMs' symbolic computing and code generation capabilities. Many robot TAMP tasks involve complex optimization under multiple constraints, where pure textual reasoning is insufficient. While augmenting LLMs with predefined solvers and planners improves performance, it lacks generalization across tasks. Given LLMs' growing coding proficiency, we enhance their TAMP capabilities by steering them to generate code as symbolic planners for optimization and constraint verification. Unlike prior work that uses code to interface with robot action modules, we steer LLMs to generate code as solvers, planners, and checkers for TAMP tasks requiring symbolic computing, while still leveraging textual reasoning to incorporate common sense. With a multi-round guidance and answer evolution framework, the proposed Code-as-Symbolic-Planner improves success rates by average 24.1\% over best baseline methods across seven typical TAMP tasks and three popular LLMs. Code-as-Symbolic-Planner shows strong effectiveness and generalizability across discrete and continuous environments, 2D/3D simulations and real-world settings, as well as single- and multi-robot tasks with diverse requirements. See our project website https://yongchao98.github.io/Code-Symbol-Planner/ for prompts, videos, and code.

Authors:Eliya Habba, Ofir Arviv, Itay Itzhak, Yotam Perlitz, Elron Bandel, Leshem Choshen, Michal Shmueli-Scheuer, Gabriel Stanovsky
Title: DOVE: A Large-Scale Multi-Dimensional Predictions Dataset Towards Meaningful LLM Evaluation
Abstract:
Recent work found that LLMs are sensitive to a wide range of arbitrary prompt dimensions, including the type of delimiters, answer enumerators, instruction wording, and more. This throws into question popular single-prompt evaluation practices. We present DOVE (Dataset Of Variation Evaluation) a large-scale dataset containing prompt perturbations of various evaluation benchmarks. In contrast to previous work, we examine LLM sensitivity from an holistic perspective, and assess the joint effects of perturbations along various dimensions, resulting in thousands of perturbations per instance. We evaluate several model families against DOVE, leading to several findings, including efficient methods for choosing well-performing prompts, observing that few-shot examples reduce sensitivity, and identifying instances which are inherently hard across all perturbations. DOVE consists of more than 250M prompt perturbations and model outputs, which we make publicly available to spur a community-wide effort toward meaningful, robust, and efficient evaluation. Browse the data, contribute, and more: https://slab-nlp.github.io/DOVE/

Authors:Zhanghao Hu, Hanqi Yan, Qinglin Zhu, Zhenyi Shen, Yulan He, Lin Gui
Title: Beyond Prompting: An Efficient Embedding Framework for Open-Domain Question Answering
Abstract:
Large language models have recently pushed open domain question answering (ODQA) to new frontiers. However, prevailing retriever-reader pipelines often depend on multiple rounds of prompt level instructions, leading to high computational overhead, instability, and suboptimal retrieval coverage. In this paper, we propose EmbQA, an embedding-level framework that alleviates these shortcomings by enhancing both the retriever and the reader. Specifically, we refine query representations via lightweight linear layers under an unsupervised contrastive learning objective, thereby reordering retrieved passages to highlight those most likely to contain correct answers. Additionally, we introduce an exploratory embedding that broadens the model's latent semantic space to diversify candidate generation and employs an entropy-based selection mechanism to choose the most confident answer automatically. Extensive experiments across three open-source LLMs, three retrieval methods, and four ODQA benchmarks demonstrate that EmbQA substantially outperforms recent baselines in both accuracy and efficiency.
中文:EmbQA框架通过优化查询表示和扩展候选答案多样性,显著提升了开放域问答的准确性和效率,超越了现有方法。
English: The EmbQA framework enhances open domain question answering by improving query representations and diversifying candidate generation, significantly boosting both accuracy and efficiency over existing methods.

Authors:Alexander Baranov, Anna Palatkina, Yulia Makovka, Pavel Braslavski
Title: KoWit-24: A Richly Annotated Dataset of Wordplay in News Headlines
Abstract:
We present KoWit-24, a dataset with fine-grained annotation of wordplay in 2,700 Russian news headlines. KoWit-24 annotations include the presence of wordplay, its type, wordplay anchors, and words/phrases the wordplay refers to. Unlike the majority of existing humor collections of canned jokes, KoWit-24 provides wordplay contexts -- each headline is accompanied by the news lead and summary. The most common type of wordplay in the dataset is the transformation of collocations, idioms, and named entities -- the mechanism that has been underrepresented in previous humor datasets. Our experiments with five LLMs show that there is ample room for improvement in wordplay detection and interpretation tasks. The dataset and evaluation scripts are available at https://github.com/Humor-Research/KoWit-24
中文: KoWit-24发布了包含俄语新闻标题细粒度文字游戏标注的数据集,揭示了惯用语转换的主要类型,并证明大语言模型在文字游戏识别任务中仍有明显不足。
English: KoWit-24 introduces a Russian news headline dataset with detailed wordplay annotations, highlighting prevalent transformations of collocations and idioms while demonstrating significant challenges for LLMs in detection tasks.

Authors:Disen Lan, Weigao Sun, Jiaxi Hu, Jusen Du, Yu Cheng
Title: Liger: Linearizing Large Language Models to Gated Recurrent Structures
Abstract:
Transformers with linear recurrent modeling offer linear-time training and constant-memory inference. Despite their demonstrated efficiency and performance, pretraining such non-standard architectures from scratch remains costly and risky. The linearization of large language models (LLMs) transforms pretrained standard models into linear recurrent structures, enabling more efficient deployment. However, current linearization methods typically introduce additional feature map modules that require extensive fine-tuning and overlook the gating mechanisms used in state-of-the-art linear recurrent models. To address these issues, this paper presents Liger, short for Linearizing LLMs to gated recurrent structures. Liger is a novel approach for converting pretrained LLMs into gated linear recurrent models without adding extra parameters. It repurposes the pretrained key matrix weights to construct diverse gating mechanisms, facilitating the formation of various gated recurrent structures while avoiding the need to train additional components from scratch. Using lightweight fine-tuning with Low-Rank Adaptation (LoRA), Liger restores the performance of the linearized gated recurrent models to match that of the original LLMs. Additionally, we introduce Liger Attention, an intra-layer hybrid attention mechanism, which significantly recovers 93\% of the Transformer-based LLM at 0.02\% pre-training tokens during the linearization process, achieving competitive results across multiple benchmarks, as validated on models ranging from 1B to 8B parameters. Code is available at https://github.com/OpenSparseLLMs/Linearization.
中文摘要:Liger是一种创新方法,可将预训练大语言模型转化为无需添加参数的线性门控循环结构,通过轻量微调保持模型性能,并在多项基准测试中取得优异表现。
English Summary: Liger is a novel method that converts pretrained LLMs into efficient gated linear recurrent models without adding parameters, using lightweight fine-tuning to maintain performance while enabling competitive results across benchmarks.

Authors:Huifeng Yin, Yu Zhao, Minghao Wu, Xuanfan Ni, Bo Zeng, Hao Wang, Tianqi Shi, Liangying Shao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Title: Marco-o1 v2: Towards Widening The Distillation Bottleneck for Reasoning Models
Abstract:
Large Reasoning Models(LRMs) such as OpenAI o1 and DeepSeek-R1 have shown remarkable reasoning capabilities by scaling test-time compute and generating long Chain-of-Thought(CoT). Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models, but faces a critical bottleneck: we found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases (i.e. over-thinking) when using Supervised Fine-tuning (SFT) and Reinforcement Learning (RL) methods. To alleviate this bottleneck, we propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search(MCTS). We then exploit a set of CoT-aware approaches, including Thoughts Length Balance, Fine-grained DPO, and Joint Post-training Objective, to enhance SFT and RL on the constructed data. We conduct evaluation on various benchmarks such as math (GSM8K, MATH, AIME). instruction-following (Multi-IF) and planning (Blocksworld), results demonstrate our approaches substantially improve the reasoning performance of distilled models compared to standard distilled models via reducing the hallucinations in long-time thinking. The project homepage is https://github.com/AIDC-AI/Marco-o1.
中文: 从大型推理模型蒸馏长思维链数据会导致小模型学习困难和偏见继承,通过蒙特卡洛树搜索构建树状推理数据并采用思维链感知训练方法,可显著提升蒸馏模型的推理性能。
English: Distilling long chain-of-thought data from large reasoning models causes learning difficulties and bias inheritance in smaller models, which is mitigated by constructing tree-based reasoning data via Monte Carlo Tree Search and implementing CoT-aware training techniques to significantly enhance reasoning performance.

Authors:Zekun Zhou, Xiaocheng Feng, Lei Huang, Xiachong Feng, Ziyun Song, Ruihan Chen, Liang Zhao, Weitao Ma, Yuxuan Gu, Baoxin Wang, Dayong Wu, Guoping Hu, Ting Liu, Bing Qin
Title: From Hypothesis to Publication: A Comprehensive Survey of AI-Driven Research Support Systems
Abstract:
Research is a fundamental process driving the advancement of human civilization, yet it demands substantial time and effort from researchers. In recent years, the rapid development of artificial intelligence (AI) technologies has inspired researchers to explore how AI can accelerate and enhance research. To monitor relevant advancements, this paper presents a systematic review of the progress in this domain. Specifically, we organize the relevant studies into three main categories: hypothesis formulation, hypothesis validation, and manuscript publication. Hypothesis formulation involves knowledge synthesis and hypothesis generation. Hypothesis validation includes the verification of scientific claims, theorem proving, and experiment validation. Manuscript publication encompasses manuscript writing and the peer review process. Furthermore, we identify and discuss the current challenges faced in these areas, as well as potential future directions for research. Finally, we also offer a comprehensive overview of existing benchmarks and tools across various domains that support the integration of AI into the research process. We hope this paper serves as an introduction for beginners and fosters future research. Resources have been made publicly available at https://github.com/zkzhou126/AI-for-Research.
中文: 本文系统综述了人工智能在假设提出、验证及论文发表等研究环节的加速作用,并探讨了当前挑战与未来方向。
English: This paper systematically reviews how AI accelerates research across hypothesis formulation, validation, and manuscript publication, while addressing current challenges and future directions.

Authors:Jia-Chen Zhang, Yu-Jie Xiong, Chun-Ming Xia, Dong-Hai Zhu, Xi-He Qiu
Title: Parameter-Efficient Fine-Tuning of Large Language Models via Deconvolution in Subspace
Abstract:
Large language model (LLM) is considered a milestone towards achieving Artificial General Intelligence (AGI). With its advanced emergent capabilities, it adapt to a wide range of specific applications. Fine-tuning LLMs for various downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) is well-known for its parameter efficiency. It can reduce the number of parameters needed to fine-tune LLMs by several orders of magnitude. However, LoRA-based approaches encounter a significant limitation due to the bottleneck imposed by rank one decomposition. As the parameters count in LLMs increase, even rank one decomposition might surpass the number of parameters truly necessary for handling more downstream tasks. In this paper, we propose a new method for Parameter-Efficient Fine-Tuning (PEFT) via deconvolution in subspace, dubbed as DCFT. We innovatively use deconvolution to complete details and enhance knowledge in subspace incremental matrices, and dynamically control parameters by adjusting the kernel size, unconstrained by rank-one decomposition. Extensive experiments are conducted to validate the effectiveness of DCFT. Results show that compared to LoRA, DCFT achieve an 8$\times$ reduction in parameters, and still achieves highly impressive performance. Our code is available here: https://github.com/Godz-z/DCFT.
中文: 提出的DCFT方法通过子空间反卷积实现参数高效微调,动态控制参数不受秩约束,相比LoRA减少8倍参数量仍保持优异性能。
English: The proposed DCFT method introduces deconvolution in subspace for parameter-efficient fine-tuning, dynamically controlling parameters without rank constraints and achieving an 8× reduction compared to LoRA while maintaining high performance.

Authors:Tianjie Ju, Yi Hua, Hao Fei, Zhenyu Shao, Yubin Zheng, Haodong Zhao, Mong-Li Lee, Wynne Hsu, Zhuosheng Zhang, Gongshen Liu
Title: Watch Out Your Album! On the Inadvertent Privacy Memorization in Multi-Modal Large Language Models
Abstract:
Multi-Modal Large Language Models (MLLMs) have exhibited remarkable performance on various vision-language tasks such as Visual Question Answering (VQA). Despite accumulating evidence of privacy concerns associated with task-relevant content, it remains unclear whether MLLMs inadvertently memorize private content that is entirely irrelevant to the training tasks. In this paper, we investigate how randomly generated task-irrelevant private content can become spuriously correlated with downstream objectives due to partial mini-batch training dynamics, thus causing inadvertent memorization. Concretely, we randomly generate task-irrelevant watermarks into VQA fine-tuning images at varying probabilities and propose a novel probing framework to determine whether MLLMs have inadvertently encoded such content. Our experiments reveal that MLLMs exhibit notably different training behaviors in partial mini-batch settings with task-irrelevant watermarks embedded. Furthermore, through layer-wise probing, we demonstrate that MLLMs trigger distinct representational patterns when encountering previously seen task-irrelevant knowledge, even if this knowledge does not influence their output during prompting. Our code is available at https://github.com/illusionhi/ProbingPrivacy.
中文: 研究表明多模态大语言模型在部分小批量训练中会通过虚假关联无意记忆与任务无关的私有内容,水印实验显示即使这些内容不影响输出,模型仍会触发不同的表征模式。
English: This study reveals that multi-modal large language models can inadvertently memorize task-irrelevant private content through spurious correlations in partial mini-batch training, as demonstrated by watermark experiments showing distinct representational patterns even when such content doesn't affect model outputs.

Authors:Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa
Title: Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers
Abstract:
Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .
中文摘要:OPTS通过引入显式策略选择机制优化大语言模型的提示设计,其中基于汤普森采样的方法在提升EvoPrompt性能方面表现最佳。
English Summary: OPTS introduces explicit strategy selection mechanisms to optimize prompts for large language models, with a Thompson sampling-based approach showing the best performance in enhancing EvoPrompt's effectiveness.

Authors:Chen Zhang, Mingxu Tao, Zhiyuan Liao, Yansong Feng
Title: MiLiC-Eval: Benchmarking Multilingual LLMs for China's Minority Languages
Abstract:
Large language models (LLMs) excel in high-resource languages but struggle with low-resource languages (LRLs), particularly those spoken by minority communities in China, such as Tibetan, Uyghur, Kazakh, and Mongolian. To systematically track the progress in these languages, we introduce MiLiC-Eval, a benchmark designed for minority languages in China, featuring 24K instances across 9 tasks. MiLiC-Eval focuses on underrepresented writing systems. Its parallelism between tasks and languages can provide a faithful and fine-grained assessment of linguistic and problem-solving skills. Our evaluation reveals that open-source LLMs perform poorly on syntax-intensive tasks and multi-script languages. We further demonstrate how MiLiC-Eval can help advance LRL research in handling diverse writing systems and understanding the process of language adaptation.
Chinese: MiLiC-Eval 是一个专为中国少数民族语言设计的评估基准,揭示了开源大语言模型在语法密集型任务和多文字语言处理上的不足,同时推动了低资源语言在文字系统适应方面的研究进展。
English: MiLiC-Eval is a benchmark introduced to assess the performance of large language models on underrepresented minority languages in China, revealing their struggles with syntax-intensive tasks and diverse writing systems while aiding research in language adaptation.

Authors:Dien X. Tran, Nam V. Nguyen, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le
Title: SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking
Abstract:
The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97\% strict accuracy on ISE-DSC01 and 80.82\% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation. The source code is available at: https://github.com/DAVID-NGUYEN-S16/SemViQA.
中文:SemViQA提出了一种创新的越南语事实核查框架,融合了语义证据检索和两步裁决分类,在实现顶尖准确率的同时显著提升推理速度,为打击虚假信息树立了新标杆。
English: SemViQA introduces a novel Vietnamese fact-checking framework that combines Semantic-based Evidence Retrieval and Two-step Verdict Classification, achieving state-of-the-art accuracy and improved inference speed to combat misinformation effectively.

Authors:Zhuohang Jiang, Pangjing Wu, Ziran Liang, Peter Q. Chen, Xu Yuan, Ye Jia, Jiancheng Tu, Chen Li, Peter H. F. Ng, Qing Li
Title: HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning
Abstract:
Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures (\emph{e.g.} graphs), overlooking the hierarchical relationships within them. Hierarchical structure reasoning is crucial for human cognition, particularly in memory organization and problem-solving. It also plays a key role in various real-world tasks, such as information extraction and decision-making. To address this gap, we propose HiBench, the first framework spanning from initial structure generation to final proficiency assessment, designed to benchmark the hierarchical reasoning capabilities of LLMs systematically. HiBench encompasses six representative scenarios, covering both fundamental and practical aspects, and consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries. To evaluate LLMs comprehensively, we develop five capability dimensions that depict different facets of hierarchical structure understanding. Through extensive evaluation of 20 LLMs from 10 model families, we reveal key insights into their capabilities and limitations: 1) existing LLMs show proficiency in basic hierarchical reasoning tasks; 2) they still struggle with more complex structures and implicit hierarchical representations, especially in structural modification and textual reasoning. Based on these findings, we create a small yet well-designed instruction dataset, which enhances LLMs' performance on HiBench by an average of 88.84\% (Llama-3.1-8B) and 31.38\% (Qwen2.5-7B) across all tasks. The HiBench dataset and toolkit are available here, https://github.com/jzzzzh/HiBench, to encourage evaluation.
中文: HiBench是一个全面评估大语言模型层次推理能力的框架,通过涵盖六个场景和30项任务弥补现有基准的不足,研究发现LLMs在基础层次任务表现良好,但在复杂结构和隐式表征方面仍有困难,而针对性指令数据能显著提升其性能。
English: HiBench is a comprehensive framework designed to evaluate the hierarchical reasoning capabilities of large language models, addressing the gap in existing benchmarks by covering six scenarios and 30 tasks, and it reveals that while LLMs excel in basic hierarchical tasks, they struggle with complex structures and implicit representations, with performance significantly improved through targeted instruction data.

Authors:Kashun Shum, Yuzhen Huang, Hongjian Zou, Qi Ding, Yixuan Liao, Xiaoxin Chen, Qian Liu, Junxian He
Title: Predictive Data Selection: The Data That Predicts Is the Data That Teaches
Abstract:
Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks(Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning, which shares similar intuition with Thrush et al.(2024). To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.
Chinese: 本研究提出PreSelect方法,通过轻量级评分器高效筛选预训练数据,仅用300亿标记中的10%数据训练模型,在减少90%计算量的同时显著超越基线模型性能。
English: This study introduces PreSelect, an efficient data selection method that uses a lightweight scorer to identify high-quality pretraining data, achieving superior model performance with 10x less compute by training on only 30B tokens instead of 300B.

Authors:Kai Lv, Honglin Guo, Qipeng Guo, Xipeng Qiu
Title: DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Abstract:
Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.
中文摘要:DuoDecoding是一种新颖的推测解码方法,通过在CPU和GPU上分别部署草稿模型和目标模型,在保持输出质量的同时实现了最高2.61倍的生成加速。
English Summary: DuoDecoding is a novel speculative decoding method that strategically deploys draft and target models on CPU and GPU respectively, achieving up to 2.61x speedup in generation latency while maintaining output quality.

Authors:Yupu Hao, Pengfei Cao, Zhuoran Jin, Huanxuan Liao, Yubo Chen, Kang Liu, Jun Zhao
Title: Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity
Abstract:
Personalized tool utilization is essential for aligning large language models (LLMs) with user preference in interaction scenarios with various tools. However, most of the current benchmarks primarily focus on either personalization of text generation or direct tool-utilizing, without considering both. In this work, we introduce a novel benchmark ETAPP for evaluating personalized tool invocation, establishing a sandbox environment, and a comprehensive dataset of 800 testing cases covering diverse user profiles. To improve the accuracy of our evaluation, we propose a key-point-based LLM evaluation method, mitigating biases in the LLM-as-a-judge system by manually annotating key points for each test case and providing them to LLM as the reference. Additionally, we evaluate the excellent LLMs and provide an in-depth analysis. Furthermore, we investigate the impact of different tool-invoking strategies on LLMs' personalization performance and the effects of fine-tuning in our task. The effectiveness of our preference-setting and key-point-based evaluation method is also validated. Our findings offer insights into improving personalized LLM agents. Our Code is available at https://github.com/hypasd-art/ETAPP.
中文: 本文提出ETAPP基准,用于评估大语言模型的个性化工具调用,包含沙盒环境、800个测试用例及基于关键点的评估方法以减少偏差,同时分析工具调用策略和微调效果。
English: This paper introduces ETAPP, a benchmark for evaluating personalized tool invocation in LLMs, featuring a sandbox environment, 800 test cases, and a key-point-based evaluation method to reduce bias, while analyzing tool-invoking strategies and fine-tuning effects.

Authors:Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro
Title: UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Abstract:
Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech. We show that with the appropriate design choices for pre-training, one can jointly learn a representation encoder and generative audio decoder that can be applied to both types of tasks. We propose UniWav, an encoder-decoder framework designed to unify pre-training representation learning and generative tasks. On speech recognition, text-to-speech, and speech tokenization, UniWav achieves comparable performance to different existing foundation models, each trained on a specific task. Our findings suggest that a single general-purpose foundation model for speech can be built to replace different foundation models, reducing the overhead and cost of pre-training.
中文: 本研究提出UniWav这一统一的语音预训练编码器-解码器框架,能同时处理判别式和生成式任务,在保持与专业模型相当性能的同时显著降低了预训练成本。
English: This work introduces UniWav, a unified encoder-decoder framework for speech pre-training that effectively handles both discriminative and generative tasks, achieving comparable performance to specialized models while reducing pre-training overhead.

Authors:Jiancheng Zhao, Xingda Yu, Yuxiang Zhang, Zhen Yang
Title: LoR2C : Low-Rank Residual Connection Adaptation for Parameter-Efficient Fine-Tuning
Abstract:
In recent years, pretrained large language models have demonstrated outstanding performance across various natural language processing tasks. However, full-parameter fine-tuning methods require adjusting all model parameters, leading to immense computational resource demands. Although parameter-efficient fine-tuning methods like LoRA have significantly reduced the number of parameters, they still face challenges such as gradient vanishing and the potential for further parameter reduction. To address these issues, this paper proposes a novel parameter-efficient fine-tuning method called LoR2C (Low-Rank Residual Connection Adaptation). LoR2C introduces residual connections with low-rank matrices within the model layers, which not only reduces the number of fine-tuning parameters but also effectively alleviates the gradient vanishing problem. Additionally, this paper presents three optimization variants of LoR2C: ShareLoR2C, MergeLoR2C, and InjectLoR2C. These variants further improve parameter efficiency and model performance through parameter sharing, module merging, and injection mechanisms, respectively. Experimental results on multiple natural language understanding and natural language generation tasks demonstrate that LoR2C and its optimized variants significantly reduce parameter overhead while maintaining or even improving performance, outperforming existing mainstream parameter-efficient fine-tuning methods.Our code is publicly available at https://github.com/Oblivioniss/LoR2C.
中文: 本文提出新型参数高效微调方法LoR2C,通过低秩残差连接减少参数并缓解梯度消失问题,其优化变体在多项自然语言处理任务中显著提升参数效率与模型性能。
English: This paper introduces LoR2C, a novel parameter-efficient fine-tuning method that uses low-rank residual connections to reduce parameters and mitigate gradient vanishing, with optimized variants further enhancing efficiency and performance across NLP tasks.

Authors:Jeonghoon Shim, Gyuhyeon Seo, Cheongsu Lim, Yohan Jo
Title: ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models
Abstract:
Tool-Augmented Language Models (TALMs) leverage external APIs to answer user queries across various domains. However, existing benchmark datasets for TALM research often feature simplistic dialogues that do not reflect real-world scenarios, such as the need for models to ask clarifying questions or proactively call additional APIs when essential information is missing. To address these limitations, we construct and release ToolDial, a dataset comprising 11,111 multi-turn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI. ToolDial has two key characteristics. First, the dialogues incorporate 16 user and system actions (e.g., "Request", "Clarify", "Fail inform") to capture the rich dynamics of real-world interactions. Second, we simulate dialogues where the system requests necessary information from the user based on API documentation and seeks additional APIs if the user fails to provide the required information. To facilitate this process, we introduce a method for generating an API graph that represents input and output compatibility between APIs. Using ToolDial, we evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history. Modern language models achieve accuracy scores below 70%, indicating substantial room for improvement. We release our dataset and code at https://github.com/holi-lab/ToolDial.
Chinese: ToolDial是一个包含11,111个多轮对话的新数据集,通过模拟真实用户-系统交互和API兼容性图来解决现有工具增强语言模型基准的不足,当前模型在预测正确操作和参数方面的准确率低于70%。
English: ToolDial is a new dataset of 11,111 multi-turn dialogues designed to address the limitations of existing benchmarks for Tool-Augmented Language Models by incorporating realistic user-system interactions and API compatibility graphs, with current models scoring below 70% accuracy in predicting correct actions and parameters.

Authors:Boyi Kang, Xinfa Zhu, Zihan Zhang, Zhen Ye, Mingshuai Liu, Ziqian Wang, Yike Zhu, Guobin Ma, Jun Chen, Longshuai Xiao, Chao Weng, Wei Xue, Lei Xie
Title: LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
Abstract:
Recent advancements in language models (LMs) have demonstrated strong capabilities in semantic understanding and contextual modeling, which have flourished in generative speech enhancement (SE). However, many LM-based SE approaches primarily focus on semantic information, often neglecting the critical role of acoustic information, which leads to acoustic inconsistency after enhancement and limited generalization across diverse SE tasks. In this paper, we introduce LLaSE-G1, a LLaMA-based language model that incentivizes generalization capabilities for speech enhancement. LLaSE-G1 offers the following key contributions: First, to mitigate acoustic inconsistency, LLaSE-G1 employs continuous representations from WavLM as input and predicts speech tokens from X-Codec2, maximizing acoustic preservation. Second, to promote generalization capability, LLaSE-G1 introduces dual-channel inputs and outputs, unifying multiple SE tasks without requiring task-specific IDs. Third, LLaSE-G1 outperforms prior task-specific discriminative and generative SE models, demonstrating scaling effects at test time and emerging capabilities for unseen SE tasks. Additionally, we release our code and models to support further research in this area.
中文: LLaSE-G1是一种基于LLaMA的语言模型,通过采用WavLM的连续表示和预测X-Codec2标记来解决语音增强中的声学不一致问题,同时通过双通道输入无需任务特定标识即可实现跨任务的泛化能力。
English: LLaSE-G1 is a LLaMA-based language model that addresses acoustic inconsistency in speech enhancement by using continuous representations from WavLM and predicting X-Codec2 tokens, while enabling generalization across multiple tasks through dual-channel inputs without task-specific IDs.

Authors:Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, Gongshen Liu
Title: Smoothing Grounding and Reasoning for MLLM-Powered GUI Agents with Query-Oriented Pivot Tasks
Abstract:
Perception-enhanced pre-training, particularly through grounding techniques, is widely adopted to enhance the performance of graphical user interface (GUI) agents. However, in resource-constrained scenarios, the format discrepancy between coordinate-oriented grounding and action-oriented reasoning limits the effectiveness of grounding for reasoning tasks. To address this challenge, we propose a query-oriented pivot approach called query inference, which serves as a bridge between GUI grounding and reasoning. By inferring potential user queries from a screenshot and its associated element coordinates, query inference improves the understanding of coordinates while aligning more closely with reasoning tasks. Experimental results show that query inference outperforms previous grounding techniques under the same training data scale. Notably, query inference achieves comparable or even better performance to large-scale grounding-enhanced OS-Atlas with less than 0.1% of training data. Furthermore, we explore the impact of reasoning formats and demonstrate that integrating additional semantic information into the input further boosts reasoning performance. The code is publicly available at https://github.com/ZrW00/GUIPivot.
中文摘要:提出的查询推理方法通过从截图和坐标推断用户查询,弥合了GUI基础与推理之间的差距,在极少训练数据下显著超越现有技术。
English Summary: The proposed query inference method bridges the gap between GUI grounding and reasoning by inferring user queries from screenshots and coordinates, significantly outperforming previous techniques with minimal training data.

Authors:Yunfan Gao, Yun Xiong, Wenlong Wu, Zijing Huang, Bohan Li, Haofen Wang
Title: U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack
Abstract:
Recent advancements in Large Language Models (LLMs) have expanded their context windows to unprecedented lengths, sparking debates about the necessity of Retrieval-Augmented Generation (RAG). To address the fragmented evaluation paradigms and limited cases in existing Needle-in-a-Haystack (NIAH), this paper introduces U-NIAH, a unified framework that systematically compares LLMs and RAG methods in controlled long context settings. Our framework extends beyond traditional NIAH by incorporating multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings, while leveraging the synthetic Starlight Academy dataset-a fictional magical universe-to eliminate biases from pre-trained knowledge. Through extensive experiments, we investigate three research questions: (1) performance trade-offs between LLMs and RAG, (2) error patterns in RAG, and (3) RAG's limitations in complex settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness, achieving an 82.58% win-rate over LLMs. However, we observe that retrieval noise and reverse chunk ordering degrade performance, while surprisingly, advanced reasoning LLMs exhibit reduced RAG compatibility due to sensitivity to semantic distractors. We identify typical error patterns including omission due to noise, hallucination under high noise critical condition, and self-doubt behaviors. Our work not only highlights the complementary roles of RAG and LLMs, but also provides actionable insights for optimizing deployments. Code: https://github.com/Tongji-KGLLM/U-NIAH.
中文:本文提出的U-NIAH框架表明,检索增强生成(RAG)能显著提升小模型鲁棒性且胜率达82.58%,但同时也揭示了检索噪声会导致性能下降,以及高级推理模型存在兼容性减弱的问题。
English: This paper introduces U-NIAH, a unified framework demonstrating that RAG significantly enhances smaller LLMs' robustness with an 82.58% win-rate, while revealing performance degradation from retrieval noise and reduced compatibility in advanced reasoning models.

Authors:Guangsheng Bao, Lihua Rong, Yanbin Zhao, Qiji Zhou, Yue Zhang
Title: Decoupling Content and Expression: Two-Dimensional Detection of AI-Generated Text
Abstract:
The wide usage of LLMs raises critical requirements on detecting AI participation in texts. Existing studies investigate these detections in scattered contexts, leaving a systematic and unified approach unexplored. In this paper, we present HART, a hierarchical framework of AI risk levels, each corresponding to a detection task. To address these tasks, we propose a novel 2D Detection Method, decoupling a text into content and language expression. Our findings show that content is resistant to surface-level changes, which can serve as a key feature for detection. Experiments demonstrate that 2D method significantly outperforms existing detectors, achieving an AUROC improvement from 0.705 to 0.849 for level-2 detection and from 0.807 to 0.886 for RAID. We release our data and code at https://github.com/baoguangsheng/truth-mirror.
中文: 本文提出HART分层框架,通过将文本解构为内容和语言表达的新型二维检测方法,在AI文本识别中显著超越现有检测器,AUROC最高提升至0.849。
English: This paper introduces HART, a hierarchical framework for detecting AI-generated text through a novel 2D method that analyzes content and language separately, significantly outperforming existing detectors with AUROC improvements up to 0.849.

Authors:Samar M. Magdy, Sang Yun Kwon, Fakhraddin Alwajih, Safaa Abdelfadil, Shady Shehata, Muhammad Abdul-Mageed
Title: Jawaher: A Multidialectal Dataset of Arabic Proverbs for LLM Benchmarking
Abstract:
Recent advancements in instruction fine-tuning, alignment methods such as reinforcement learning from human feedback (RLHF), and optimization techniques like direct preference optimization (DPO) have significantly enhanced the adaptability of large language models (LLMs) to user preferences. However, despite these innovations, many LLMs continue to exhibit biases toward Western, Anglo-centric, or American cultures, with performance on English data consistently surpassing that of other languages. This reveals a persistent cultural gap in LLMs, which complicates their ability to accurately process culturally rich and diverse figurative language such as proverbs. To address this, we introduce Jawaher, a benchmark designed to assess LLMs' capacity to comprehend and interpret Arabic proverbs. Jawaher includes proverbs from various Arabic dialects, along with idiomatic translations and explanations. Through extensive evaluations of both open- and closed-source models, we find that while LLMs can generate idiomatically accurate translations, they struggle with producing culturally nuanced and contextually relevant explanations. These findings highlight the need for ongoing model refinement and dataset expansion to bridge the cultural gap in figurative language processing.
中文:尽管大语言模型在用户偏好适应方面取得进展,但文化偏见依然存在,Jawaher基准测试显示模型对阿拉伯谚语能生成准确翻译,却难以提供文化上细致入微的解释。
English: Recent advancements in LLM alignment have improved user adaptability, yet models still exhibit cultural biases, particularly in processing Arabic proverbs, as revealed by the Jawaher benchmark showing limitations in cultural nuance despite accurate translations.

Authors:K. O. T. Erziev
Title: À la recherche du sens perdu: your favourite LLM might have more to say than you can understand
Abstract:
We report a peculiar observation that LLMs can assign hidden meanings to sequences that seem visually incomprehensible to humans: for example, a nonsensical phrase consisting of Byzantine musical symbols is recognized by gpt-4o as "say abracadabra". Moreover, some models can communicate using these sequences. Some of these meanings are hypothesized to partly originate in the massive spurious correlations due to BPE tokenization. We systematically evaluate the presence of such abilities in a wide range of models: Claude-3.5 Haiku, Claude-3.5 Sonnet (New and Old), Claude-3.7 Sonnet, gpt-4o mini, gpt-4o, o1-mini, Llama-3.3 70B, DeepSeek-R1-Distill-Lllama 70B, Qwen2.5 1.5B, Qwen2.5 32B, Phi-3.5 mini, GigaChat-Max, Vikhr-Llama-3.2 1B. We argue that this observation might have far-reaching consequences for both safety and security of the modern and future LLMs and systems that employ them. As an illustration, we show that applying this method in combination with simple templates is sufficient to jailbreak previous generation models, with ASR = 0.4 on gpt-4o mini. Our code and data artifacts are available at https://github.com/L3G5/llm-hidden-meanings
中文: 大型语言模型能够为视觉上难以理解的序列赋予隐藏含义,这可能源于分词相关性,并引发安全隐患,如其可被用于越狱模型的能力所示。
English: Large language models can assign hidden meanings to visually incomprehensible sequences, potentially due to tokenization correlations, raising safety concerns as demonstrated by their ability to jailbreak models.

Authors:Hanjiang Hu, Alexander Robey, Changliu Liu
Title: Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks
Abstract:
Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home . Our code is available on https://github.com/HanjiangHu/NBF-LLM .
中文摘要:该研究提出了一种基于神经屏障函数的安全引导框架,通过主动检测多轮对话中的有害查询并维持持续安全状态,有效防御大语言模型的多轮越狱攻击。
English Summary: The proposed safety steering framework using a neural barrier function effectively prevents multi-turn jailbreaking attacks in large language models by proactively detecting harmful queries and ensuring invariant safety throughout dialogues.

Authors:Fakhraddin Alwajih, Abdellah El Mekki, Samar Mohamed Magdy, Abdelrahim A. Elmadany, Omer Nacar, El Moatez Billah Nagoudi, Reem Abdel-Salam, Hanin Atwany, Youssef Nafea, Abdulfattah Mohammed Yahya, Rahaf Alhamouri, Hamzah A. Alsayadi, Hiba Zayed, Sara Shatnawi, Serry Sibaee, Yasir Ech-Chammakhy, Walid Al-Dhabyani, Marwa Mohamed Ali, Imen Jarraya, Ahmed Oumar El-Shangiti, Aisha Alraeesi, Mohammed Anwar Al-Ghrawi, Abdulrahman S. Al-Batati, Elgizouli Mohamed, Noha Taha Elgindi, Muhammed Saeed, Houdaifa Atou, Issam Ait Yahia, Abdelhak Bouayad, Mohammed Machrouh, Amal Makouar, Dania Alkawi, Mukhtar Mohamed, Safaa Taher Abdelfadil, Amine Ziad Ounnoughene, Rouabhia Anfel, Rwaa Assi, Ahmed Sorkatti, Mohamedou Cheikh Tourad, Anis Koubaa, Ismail Berrada, Mustafa Jarrar, Shady Shehata, Muhammad Abdul-Mageed
Title: Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs
Abstract:
As large language models (LLMs) become increasingly integrated into daily life, ensuring their cultural sensitivity and inclusivity is paramount. We introduce our dataset, a year-long community-driven project covering all 22 Arab countries. The dataset includes instructions (input, response pairs) in both Modern Standard Arabic (MSA) and dialectal Arabic (DA), spanning 20 diverse topics. Built by a team of 44 researchers across the Arab world, all of whom are authors of this paper, our dataset offers a broad, inclusive perspective. We use our dataset to evaluate the cultural and dialectal capabilities of several frontier LLMs, revealing notable limitations. For instance, while closed-source LLMs generally exhibit strong performance, they are not without flaws, and smaller open-source models face greater challenges. Moreover, certain countries (e.g., Egypt, the UAE) appear better represented than others (e.g., Iraq, Mauritania, Yemen). Our annotation guidelines, code, and data for reproducibility are publicly available.
中文摘要:本研究引入了一个涵盖所有22个阿拉伯国家的社区驱动数据集,用于评估大型语言模型的文化和方言能力,揭示了它们在性能和代表性方面存在显著不足。
English Summary: This study introduces a comprehensive, community-driven dataset covering all 22 Arab countries to evaluate the cultural and dialectal capabilities of large language models, revealing significant limitations in their performance and representation.

Authors:Magnus Sesodia, Alina Petrova, John Armour, Thomas Lukasiewicz, Oana-Maria Camburu, Puneet K. Dokania, Philip Torr, Christian Schroeder de Witt
Title: AnnoCaseLaw: A Richly-Annotated Dataset For Benchmarking Explainable Legal Judgment Prediction
Abstract:
Legal systems worldwide continue to struggle with overwhelming caseloads, limited judicial resources, and growing complexities in legal proceedings. Artificial intelligence (AI) offers a promising solution, with Legal Judgment Prediction (LJP) -- the practice of predicting a court's decision from the case facts -- emerging as a key research area. However, existing datasets often formulate the task of LJP unrealistically, not reflecting its true difficulty. They also lack high-quality annotation essential for legal reasoning and explainability. To address these shortcomings, we introduce AnnoCaseLaw, a first-of-its-kind dataset of 471 meticulously annotated U.S. Appeals Court negligence cases. Each case is enriched with comprehensive, expert-labeled annotations that highlight key components of judicial decision making, along with relevant legal concepts. Our dataset lays the groundwork for more human-aligned, explainable LJP models. We define three legally relevant tasks: (1) judgment prediction; (2) concept identification; and (3) automated case annotation, and establish a performance baseline using industry-leading large language models (LLMs). Our results demonstrate that LJP remains a formidable task, with application of legal precedent proving particularly difficult. Code and data are available at https://github.com/anonymouspolar1/annocaselaw.
中文:AnnoCaseLaw推出了首个包含471个美国上诉法院过失案件精细标注的数据集,旨在解决现有法律判决预测数据集的不足,通过定义三项法律任务并利用大语言模型建立性能基准,推动更贴近人类思维、可解释的人工智能模型发展。
English: AnnoCaseLaw introduces a novel dataset of 471 annotated U.S. Appeals Court negligence cases to address the limitations of existing Legal Judgment Prediction datasets, enabling more realistic and explainable AI models by defining three legal tasks and establishing performance baselines with large language models.

Authors:Chong Zhang, Yukun Ma, Qian Chen, Wen Wang, Shengkui Zhao, Zexu Pan, Hao Wang, Chongjia Ni, Trung Hieu Nguyen, Kun Zhou, Yidi Jiang, Chaohong Tan, Zhifu Gao, Zhihao Du, Bin Ma
Title: InspireMusic: Integrating Super Resolution and Large Language Model for High-Fidelity Long-Form Music Generation
Abstract:
We introduce InspireMusic, a framework integrated super resolution and large language model for high-fidelity long-form music generation. A unified framework generates high-fidelity music, songs, and audio, which incorporates an autoregressive transformer with a super-resolution flow-matching model. This framework enables the controllable generation of high-fidelity long-form music at a higher sampling rate from both text and audio prompts. Our model differs from previous approaches, as we utilize an audio tokenizer with one codebook that contains richer semantic information, thereby reducing training costs and enhancing efficiency. This combination enables us to achieve high-quality audio generation with long-form coherence of up to $8$ minutes. Then, an autoregressive transformer model based on Qwen 2.5 predicts audio tokens. Next, we employ a super-resolution flow-matching model to generate high-sampling rate audio with fine-grained details learned from an acoustic codec model. Comprehensive experiments show that the InspireMusic-1.5B-Long model has a comparable performance to recent top-tier open-source systems, including MusicGen and Stable Audio 2.0, on subjective and objective evaluations. The code and pre-trained models are released at https://github.com/FunAudioLLM/InspireMusic.
中文:InspireMusic是一个结合超分辨率与大语言模型的统一框架,能够根据文本或音频提示生成高保真度的长篇幅音乐,通过优化音频分词器降低训练成本,在主观和客观评估中达到与顶尖开源系统相当的性能水平。
English: InspireMusic is a unified framework that integrates super-resolution and large language models to generate high-fidelity, long-form music from text or audio prompts, achieving up to 8 minutes of coherent audio with reduced training costs and competitive performance against leading systems.

Authors:Nilay Yilmaz, Maitreya Patel, Yiran Lawrence Luo, Tejas Gokhale, Chitta Baral, Suren Jayasuriya, Yezhou Yang
Title: VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
Abstract:
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information. Despite their exceptional performance on visual understanding benchmarks, measuring their ability to reason abstractly across multiple images remains a significant challenge. To address this, we introduce VOILA, a large-scale, open-ended, dynamic benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning. VOILA employs an analogical mapping approach in the visual domain, requiring models to generate an image that completes an analogy between two given image pairs, reference and application, without relying on predefined choices. Our experiments demonstrate that the analogical reasoning tasks in VOILA present a challenge to MLLMs. Through multi-step analysis, we reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning. Notably, we observe that performance improves when following a multi-step strategy of least-to-most prompting. Comprehensive evaluations on open-source models and GPT-4o show that on text-based answers, the best accuracy for challenging scenarios is 13% (LLaMa 3.2) and even for simpler tasks is only 29% (GPT-4o), while human performance is significantly higher at 70% across both difficulty levels.
中文: VOILA作为新型评估基准,通过要求多模态大语言模型生成完成视觉类比的图像来测试其抽象关系推理能力,结果显示模型在跨图像理解方面存在显著困难,准确率远低于人类水平。
English: VOILA is a novel benchmark that evaluates multimodal large language models' abstract relational reasoning by requiring them to generate images completing visual analogies, revealing their significant struggles in inter-image understanding with accuracy far below human performance.

Authors:Shinwoo Park, Shubin Kim, Do-Kyung Kim, Yo-Sub Han
Title: KatFishNet: Detecting LLM-Generated Korean Text through Linguistic Feature Analysis
Abstract:
The rapid advancement of large language models (LLMs) increases the difficulty of distinguishing between human-written and LLM-generated text. Detecting LLM-generated text is crucial for upholding academic integrity, preventing plagiarism, protecting copyrights, and ensuring ethical research practices. Most prior studies on detecting LLM-generated text focus primarily on English text. However, languages with distinct morphological and syntactic characteristics require specialized detection approaches. Their unique structures and usage patterns can hinder the direct application of methods primarily designed for English. Among such languages, we focus on Korean, which has relatively flexible spacing rules, a rich morphological system, and less frequent comma usage compared to English. We introduce KatFish, the first benchmark dataset for detecting LLM-generated Korean text. The dataset consists of text written by humans and generated by four LLMs across three genres. By examining spacing patterns, part-of-speech diversity, and comma usage, we illuminate the linguistic differences between human-written and LLM-generated Korean text. Building on these observations, we propose KatFishNet, a detection method specifically designed for the Korean language. KatFishNet achieves an average of 19.78% higher AUROC compared to the best-performing existing detection method. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_generated_korean_text_through_linguistic_analysis.
中文: 本研究推出了首个检测韩语大语言模型生成文本的基准数据集KatFish,并提出专门方法KatFishNet,通过分析间距和词性多样性等语言特征,其AUROC指标比现有最佳方法平均提高19.78%。
English: This study introduces KatFish, the first benchmark dataset for detecting LLM-generated Korean text, and proposes KatFishNet, a specialized detection method that achieves 19.78% higher AUROC than existing approaches by analyzing linguistic features like spacing and part-of-speech diversity.

Authors:Komal Kumar, Tajamul Ashraf, Omkar Thawakar, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Phillip H. S. Torr, Fahad Shahbaz Khan, Salman Khan
Title: LLM Post-Training: A Deep Dive into Reasoning Large Language Models
Abstract:
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Pretraining on vast web-scale data has laid the foundation for these models, yet the research community is now increasingly shifting focus toward post-training techniques to achieve further breakthroughs. While pretraining provides a broad linguistic foundation, post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations. Fine-tuning, reinforcement learning, and test-time scaling have emerged as critical strategies for optimizing LLMs performance, ensuring robustness, and improving adaptability across various real-world tasks. This survey provides a systematic exploration of post-training methodologies, analyzing their role in refining LLMs beyond pretraining, addressing key challenges such as catastrophic forgetting, reward hacking, and inference-time trade-offs. We highlight emerging directions in model alignment, scalable adaptation, and inference-time reasoning, and outline future research directions. We also provide a public repository to continually track developments in this fast-evolving field: https://github.com/mbzuai-oryx/Awesome-LLM-Post-training.
中文摘要:本综述系统探讨了大型语言模型的后训练方法,这些方法在预训练基础上通过增强推理能力、事实准确性和伦理对齐来优化模型性能,同时解决了灾难性遗忘等关键挑战并展望了未来研究方向。
English Summary: This survey systematically examines post-training methods that refine Large Language Models beyond pretraining by enhancing reasoning, factual accuracy, and ethical alignment, while addressing challenges like catastrophic forgetting and outlining future research directions.

Authors:Dingyi Zhang, Deyu Zhou
Title: Persuasion Should be Double-Blind: A Multi-Domain Dialogue Dataset With Faithfulness Based on Causal Theory of Mind
Abstract:
Persuasive dialogue plays a pivotal role in human communication, influencing various domains. Recent persuasive dialogue datasets often fail to align with real-world interpersonal interactions, leading to unfaithful representations. For instance, unrealistic scenarios may arise, such as when the persuadee explicitly instructs the persuader on which persuasion strategies to employ, with each of the persuadee's questions corresponding to a specific strategy for the persuader to follow. This issue can be attributed to a violation of the "Double Blind" condition, where critical information is fully shared between participants. In actual human interactions, however, key information such as the mental state of the persuadee and the persuasion strategies of the persuader is not directly accessible. The persuader must infer the persuadee's mental state using Theory of Mind capabilities and construct arguments that align with the persuadee's motivations. To address this gap, we introduce ToMMA, a novel multi-agent framework for dialogue generation that is guided by causal Theory of Mind. This framework ensures that information remains undisclosed between agents, preserving "double-blind" conditions, while causal ToM directs the persuader's reasoning, enhancing alignment with human-like persuasion dynamics. Consequently, we present CToMPersu, a multi-domain, multi-turn persuasive dialogue dataset that tackles both double-blind and logical coherence issues, demonstrating superior performance across multiple metrics and achieving better alignment with real human dialogues. Our dataset and prompts are available at https://github.com/DingyiZhang/ToMMA-CToMPersu .
Chinese Summary: ToMMA框架采用因果心智理论构建多智能体说服对话系统,通过保持双盲条件增强真实性,其CToMPersu数据集在模拟人类对话方面优于现有基准。
English Summary: The ToMMA framework introduces a multi-agent persuasive dialogue system using causal Theory of Mind to maintain double-blind conditions and improve realism, accompanied by the CToMPersu dataset that outperforms existing benchmarks in mimicking human interactions.

Authors:Zhenyi Shen, Hanqi Yan, Linhai Zhang, Zhanghao Hu, Yali Du, Yulan He
Title: CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation
Abstract:
Chain-of-Thought (CoT) reasoning enhances Large Language Models (LLMs) by encouraging step-by-step reasoning in natural language. However, leveraging a latent continuous space for reasoning may offer benefits in terms of both efficiency and robustness. Prior implicit CoT methods attempt to bypass language completely by reasoning in continuous space but have consistently underperformed compared to the standard explicit CoT approach. We introduce CODI (Continuous Chain-of-Thought via Self-Distillation), a novel training framework that effectively compresses natural language CoT into continuous space. CODI jointly trains a teacher task (Explicit CoT) and a student task (Implicit CoT), distilling the reasoning ability from language into continuous space by aligning the hidden states of a designated token. Our experiments show that CODI is the first implicit CoT approach to match the performance of explicit CoT on GSM8k at the GPT-2 scale, achieving a 3.1x compression rate and outperforming the previous state-of-the-art by 28.2% in accuracy. CODI also demonstrates robustness, generalizable to complex datasets, and interpretability. These results validate that LLMs can reason effectively not only in natural language, but also in a latent continuous space. Code is available at https://github.com/zhenyi4/codi.
中文摘要:CODI提出了一种新颖的训练框架,能够将自然语言思维链推理有效压缩至连续空间,在保持与显式方法相当性能的同时,展现出更优的效率、鲁棒性和可解释性。
English Summary: CODI introduces a novel training framework that effectively compresses natural language chain-of-thought reasoning into continuous space, achieving comparable performance to explicit methods while demonstrating superior efficiency, robustness, and interpretability.

Authors:Fangxu Yu, Lai Jiang, Shenyi Huang, Zhen Wu, Xinyu Dai
Title: PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues
Abstract:
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social scenarios. Although recent studies have evaluated ToM in Large Language Models (LLMs), existing benchmarks focus on simplified settings (e.g., Sally-Anne-style tasks) and overlook the complexity of real-world social interactions. To mitigate this gap, we propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues. Our framework contains two core tasks: ToM Reasoning, which tests tracking of evolving desires, beliefs, and intentions; and ToM Application, which assesses the use of inferred mental states to predict and evaluate persuasion strategies. Experiments across eight leading LLMs reveal that while models excel on multiple questions, they struggle with the tasks that need tracking the dynamics and shifts of mental states and understanding the mental states in the whole dialogue comprehensively. Our aim with PersuasiveToM is to allow an effective evaluation of the ToM reasoning ability of LLMs with more focus on complex psychological activities. Our code is available at https://github.com/Yu-Fangxu/PersuasiveToM.
Chinese: PersuasiveToM基准测试通过说服性对话评估大语言模型的心理理论能力,发现尽管模型在简单任务上表现出色,但在追踪动态心理状态方面仍存在明显不足。
English: The PersuasiveToM benchmark evaluates large language models' Theory of Mind abilities in persuasive dialogues, revealing their limitations in tracking dynamic mental states despite strong performance on simpler tasks.

Authors:Xiusheng Huang, Jiaxiang Liu, Yequan Wang, Jun Zhao, Kang Liu
Title: Capability Localization: Capabilities Can be Localized rather than Individual Knowledge
Abstract:
Large scale language models have achieved superior performance in tasks related to natural language processing, however, it is still unclear how model parameters affect performance improvement. Previous studies assumed that individual knowledge is stored in local parameters, and the storage form of individual knowledge is dispersed parameters, parameter layers, or parameter chains, which are not unified. We found through fidelity and reliability evaluation experiments that individual knowledge cannot be localized. Afterwards, we constructed a dataset for decoupling experiments and discovered the potential for localizing data commonalities. To further reveal this phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method, which successfully locates commonality neurons and achieves a neuron overlap rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through cross data experiments that commonality neurons are a collection of capability neurons that possess the capability to enhance performance. Our code is available at https://github.com/nlpkeg/Capability-Neuron-Localization.
中文: 本研究挑战了关于大语言模型中个体知识存储于局部参数的假设,提出共性神经元定位方法,在GSM8K数据集上成功定位共性神经元并达到96.42%的重叠率,验证了这些神经元作为能力神经元集合对性能提升的作用。
English: This study challenges the assumption that individual knowledge is stored in localized parameters of large language models, proposing a Commonality Neuron Localization method that successfully identifies shared capability neurons with a 96.42% overlap rate on GSM8K, demonstrating their role in performance enhancement.

Authors:Thanet Markchom, Tong Wu, Liting Huang, Huizhi Liang
Title: UoR-NCL at SemEval-2025 Task 1: Using Generative LLMs and CLIP Models for Multilingual Multimodal Idiomaticity Representation
Abstract:
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less effective than using embeddings without fine-tuning. The source code used in this paper is available at https://github.com/tongwu17/SemEval-2025-Task1-UoR-NCL.
中文: 本研究通过使用大语言模型生成习语含义和多语言CLIP模型进行编码,提升了图像与习语复合词的匹配排序效果,实验表明多模态表征优于仅基于原始复合词的方法。
English: This study enhances image ranking for idiomatic compounds by using LLMs to generate meanings and multilingual CLIP models for encoding, with results showing improved performance through multimodal representations over original compounds.

Authors:Jonathan Drechsel, Anja Reusch, Steffen Herbold
Title: MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training
Abstract:
Mathematical formulas are a fundamental and widely used component in various scientific fields, serving as a universal language for expressing complex concepts and relationships. While state-of-the-art transformer models excel in processing and understanding natural language, they encounter challenges with mathematical notation, which involves a complex structure and diverse representations. This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content. We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in LaTeX notation, effectively capturing the mathematical variety in notation of the same concept. Based on MAMUT, we have generated four large mathematical datasets containing diverse notation. Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks. We publish our code, generated datasets, and pretrained mathematical models: https://github.com/aieng-lab/math-mutator.
中文摘要:本研究提出Math Mutator (MAMUT)框架,通过生成多样化数学公式变体构建专业训练数据集,在数学检索任务中实现了最先进的性能表现。
English Summary: This study introduces Math Mutator (MAMUT), a framework that generates diverse mathematical formula variations to create specialized training datasets, achieving state-of-the-art performance in mathematical retrieval tasks.

Authors:Kuang-Da Wang, Teng-Ruei Chen, Yu Heng Hung, Guo-Xun Ko, Shuoyang Ding, Yueh-Hua Wu, Yu-Chiang Frank Wang, Chao-Han Huck Yang, Wen-Chih Peng, Ping-Chun Hsieh
Title: Plan2Align: Predictive Planning Based Test-Time Preference Alignment for Large Language Models
Abstract:
Aligning Large Language Models with Preference Fine-Tuning is often resource-intensive. Test-time alignment techniques that do not modify the underlying models, such as prompting and guided decodings, offer a lightweight alternative. However, existing test-time alignment methods primarily improve short responses and fail to ensure coherence over extended contexts due to the myopic nature of token-level alignment. Moreover, these methods often incur a slowdown during inference. To address these challenges, we propose Plan2Align, a test-time alignment framework that formulates text generation as a predictive planning problem. Plan2Align adapts Model Predictive Control (MPC) to iteratively refine output by rolling out multiple complete responses and optimizing each segment. To more rigorously evaluate the effectiveness and efficiency, we focus on the more challenging task of long-text generation. Experiments on the long-form response subset of the HH-RLHF dataset and the WMT'24 Discourse-Level Literary Translation demonstrate that Plan2Align significantly enhances the performance of base LLMs. Compared to existing training-time and test-time alignment methods on LLaMA-3.1 8B, Plan2Align achieves comparable or superior results, while also delivering improved inference efficiency relative to prior test-time alignment approaches.
中文摘要:Plan2Align是一种测试时对齐框架,将文本生成视为预测性规划问题,通过模型预测控制迭代优化输出,在长文本生成任务中相比现有方法展现出更优的性能和效率。
English Summary: Plan2Align is a test-time alignment framework that treats text generation as a predictive planning task, using Model Predictive Control to iteratively refine outputs and demonstrating superior performance and efficiency in long-text generation tasks compared to existing methods.

Authors:Qiao Yan, Yuchen Yuan, Xiaowei Hu, Yihan Wang, Jiaqi Xu, Jinpeng Li, Chi-Wing Fu, Pheng-Ann Heng
Title: MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models
Abstract:
The increasing use of vision-language models (VLMs) in healthcare applications presents great challenges related to hallucinations, in which the models may generate seemingly plausible results that are in fact incorrect. Such hallucinations can jeopardize clinical decision making, potentially harming the diagnosis and treatments. In this work, we propose MedHallTune, a large-scale benchmark designed specifically to evaluate and mitigate hallucinations in medical VLMs. Comprising over 100,000 images and 1,000,000 instruction pairs, MedHallTune includes both hallucination and non-hallucination samples, each with ground-truth annotations. We conduct a comprehensive evaluation of current medical and general VLMs using MedHallTune, assessing their performance across key metrics, including clinical accuracy, relevance, detail level, and risk level. The experimental results show that fine-tuning with MedHallTune successfully improves the ability of several existing models to manage hallucinations and boost their zero-shot performance on downstream visual-question-answering (VQA) tasks, making them more reliable for practical medical applications. Our work contributes to the development of more trustworthy VLMs. Codes and dataset will be available at \href{https://github.com/russellyq/MedHallTune}{MedHallTune}.
中文: 本研究提出了MedHallTune这一大规模基准,用于评估和减少医学视觉语言模型的幻觉问题,实验表明基于该数据的微调能有效提升模型在医疗应用中的可靠性。
English: This study introduces MedHallTune, a large-scale benchmark to evaluate and reduce hallucinations in medical vision-language models, demonstrating that fine-tuning with it enhances model reliability for healthcare applications.

Authors:Yingqi Gao, Zhiling Luo
Title: Automatic database description generation for Text-to-SQL
Abstract:
In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93\% compared to not using descriptions, and achieves 37\% of human-level performance. The source code is publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.
Chinese: 该报告提出了一种双过程方法,在Text-to-SQL任务中自动生成数据库描述,通过从粗到细和从细到粗相结合的方式,在Bird基准测试中将SQL生成准确率提升了0.93%。
English: This report introduces a dual-process method for automatically generating database descriptions in Text-to-SQL tasks, combining coarse-to-fine and fine-to-coarse approaches to improve SQL generation accuracy by 0.93% on the Bird benchmark.

Authors:Haitao Li, Yifan Chen, Yiran Hu, Qingyao Ai, Junjie Chen, Xiaoyu Yang, Jianhui Yang, Yueyue Wu, Zeyang Liu, Yiqun Liu
Title: LexRAG: Benchmarking Retrieval-Augmented Generation in Multi-Turn Legal Consultation Conversation
Abstract:
Retrieval-augmented generation (RAG) has proven highly effective in improving large language models (LLMs) across various domains. However, there is no benchmark specifically designed to assess the effectiveness of RAG in the legal domain, which restricts progress in this area. To fill this gap, we propose LexRAG, the first benchmark to evaluate RAG systems for multi-turn legal consultations. LexRAG consists of 1,013 multi-turn dialogue samples and 17,228 candidate legal articles. Each sample is annotated by legal experts and consists of five rounds of progressive questioning. LexRAG includes two key tasks: (1) Conversational knowledge retrieval, requiring accurate retrieval of relevant legal articles based on multi-turn context. (2) Response generation, focusing on producing legally sound answers. To ensure reliable reproducibility, we develop LexiT, a legal RAG toolkit that provides a comprehensive implementation of RAG system components tailored for the legal domain. Additionally, we introduce an LLM-as-a-judge evaluation pipeline to enable detailed and effective assessment. Through experimental analysis of various LLMs and retrieval methods, we reveal the key limitations of existing RAG systems in handling legal consultation conversations. LexRAG establishes a new benchmark for the practical application of RAG systems in the legal domain, with its code and data available at https://github.com/CSHaitao/LexRAG.
Chinese: LexRAG是首个针对多轮法律咨询的检索增强生成系统评估基准,通过标注对话和法律条文填补了法律领域专业评估的空白。
English: LexRAG is the first benchmark designed to evaluate retrieval-augmented generation systems for multi-turn legal consultations, featuring annotated dialogues and legal articles to address the lack of specialized assessments in the legal domain.

Authors:Kai Mei, Wujiang Xu, Shuhang Lin, Yongfeng Zhang
Title: OmniRouter: Budget and Performance Controllable Multi-LLM Routing
Abstract:
Large language models (LLMs) deliver superior performance but require substantial computational resources and operate with relatively low efficiency, while smaller models can efficiently handle simpler tasks with fewer resources. LLM routing is a crucial paradigm that dynamically selects the most suitable large language models from a pool of candidates to process diverse inputs, ensuring optimal resource utilization while maintaining response quality. Existing routing frameworks typically model this as a locally optimal decision-making problem, selecting the presumed best-fit LLM for each query individually, which overlook global budget constraints, resulting in ineffective resource allocation. To tackle this problem, we introduce OmniRouter, a fundamentally controllable routing framework for multi-LLM serving. Instead of making per-query greedy choices, OmniRouter models the routing task as a constrained optimization problem, assigning models that minimize total cost while ensuring the required performance level. Specifically, a hybrid retrieval-augmented predictor is designed to predict the capabilities and costs of LLMs and a constrained optimizer is employed to control globally optimal query-model allocation. Experiments show that OmniRouter achieves up to 6.30% improvement in response accuracy while simultaneously reducing computational costs by at least 10.15% compared to competitive router baselines. The code and the dataset are available at https://github.com/agiresearch/OmniRouter.
中文摘要:OmniRouter是一种创新的LLM路由框架,通过将模型选择构建为约束优化问题,在降低计算成本的同时提升了响应准确性,相比现有方法表现更优。
English Summary: OmniRouter is a novel LLM routing framework that formulates model selection as a constrained optimization problem, achieving higher response accuracy while significantly reducing computational costs compared to existing methods.

Authors:Sari Masri, Huthaifa I. Ashqar, Mohammed Elhenawy
Title: Visual Reasoning at Urban Intersections: FineTuning GPT-4o for Traffic Conflict Detection
Abstract:
Traffic control in unsignalized urban intersections presents significant challenges due to the complexity, frequent conflicts, and blind spots. This study explores the capability of leveraging Multimodal Large Language Models (MLLMs), such as GPT-4o, to provide logical and visual reasoning by directly using birds-eye-view videos of four-legged intersections. In this proposed method, GPT-4o acts as intelligent system to detect conflicts and provide explanations and recommendations for the drivers. The fine-tuned model achieved an accuracy of 77.14%, while the manual evaluation of the true predicted values of the fine-tuned GPT-4o showed significant achievements of 89.9% accuracy for model-generated explanations and 92.3% for the recommended next actions. These results highlight the feasibility of using MLLMs for real-time traffic management using videos as inputs, offering scalable and actionable insights into intersections traffic management and operation. Code used in this study is available at https://github.com/sarimasri3/Traffic-Intersection-Conflict-Detection-using-images.git.
中文: 本研究证明,多模态大语言模型如GPT-4o能通过分析鸟瞰视频有效管理无信号灯交叉口,实现冲突检测和驾驶建议,在解释说明和可操作建议方面均展现出高准确率。
English: This study demonstrates that Multimodal Large Language Models like GPT-4o can effectively manage unsignalized intersections by analyzing bird's-eye-view videos to detect conflicts and provide driving recommendations, achieving high accuracy in explanations and actionable insights.

Authors:Jin Peng Zhou, Kaiwen Wang, Jonathan Chang, Zhaolin Gao, Nathan Kallus, Kilian Q. Weinberger, Kianté Brantley, Wen Sun
Title: $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training
Abstract:
Reinforcement learning (RL) post-training is crucial for LLM alignment and reasoning, but existing policy-based methods, such as PPO and DPO, can fall short of fixing shortcuts inherited from pre-training. In this work, we introduce $Q\sharp$, a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function. We propose to learn the optimal $Q$ function using distributional RL on an aggregated online dataset. Unlike prior value-based baselines that guide the model using unregularized $Q$-values, our method is theoretically principled and provably learns the optimal policy for the KL-regularized RL problem. Empirically, $Q\sharp$ outperforms prior baselines in math reasoning benchmarks while maintaining a smaller KL divergence to the reference policy. Theoretically, we establish a reduction from KL-regularized RL to no-regret online learning, providing the first bounds for deterministic MDPs under only realizability. Thanks to distributional RL, our bounds are also variance-dependent and converge faster when the reference policy has small variance. In sum, our results highlight $Q\sharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees. The code can be found at https://github.com/jinpz/q_sharp.
中文: 提出的$Q\sharp$算法采用基于价值的分布强化学习方法优化KL正则化强化学习,在数学推理基准上优于现有方法,同时保持理论保证和更小的KL散度。
English: The proposed $Q\sharp$ algorithm introduces a value-based approach using distributional reinforcement learning to optimize KL-regularized RL, outperforming existing methods in math reasoning while maintaining theoretical guarantees and smaller KL divergence.

Authors:Julius Broomfield, Kartik Sharma, Srijan Kumar
Title: A Thousand Words or An Image: Studying the Influence of Persona Modality in Multimodal LLMs
Abstract:
Large language models (LLMs) have recently demonstrated remarkable advancements in embodying diverse personas, enhancing their effectiveness as conversational agents and virtual assistants. Consequently, LLMs have made significant strides in processing and integrating multimodal information. However, even though human personas can be expressed in both text and image, the extent to which the modality of a persona impacts the embodiment by the LLM remains largely unexplored. In this paper, we investigate how do different modalities influence the expressiveness of personas in multimodal LLMs. To this end, we create a novel modality-parallel dataset of 40 diverse personas varying in age, gender, occupation, and location. This consists of four modalities to equivalently represent a persona: image-only, text-only, a combination of image and small text, and typographical images, where text is visually stylized to convey persona-related attributes. We then create a systematic evaluation framework with 60 questions and corresponding metrics to assess how well LLMs embody each persona across its attributes and scenarios. Comprehensive experiments on $5$ multimodal LLMs show that personas represented by detailed text show more linguistic habits, while typographical images often show more consistency with the persona. Our results reveal that LLMs often overlook persona-specific details conveyed through images, highlighting underlying limitations and paving the way for future research to bridge this gap. We release the data and code at https://github.com/claws-lab/persona-modality .
中文摘要:本研究探讨了不同模态如何影响多模态大语言模型的人格体现,发现文本描述的人格能增强语言习惯,而排版图像则提高一致性,但模型常忽略图像传达的人格细节。
English Summary: This study explores how different modalities affect persona embodiment in multimodal large language models, revealing that text-based personas enhance linguistic habits while typographical images improve consistency, yet models often miss image-conveyed persona details.

Authors:Jonathan Tonglet, Tinne Tuytelaars, Marie-Francine Moens, Iryna Gurevych
Title: Protecting multimodal large language models against misleading visualizations
Abstract:
Visualizations play a pivotal role in daily communication in an increasingly datadriven world. Research on multimodal large language models (MLLMs) for automated chart understanding has accelerated massively, with steady improvements on standard benchmarks. However, for MLLMs to be reliable, they must be robust to misleading visualizations, i.e., charts that distort the underlying data, leading readers to draw inaccurate conclusions that may support disinformation. Here, we uncover an important vulnerability: MLLM questionanswering (QA) accuracy on misleading visualizations drops on average to the level of the random baseline. To address this, we introduce the first inference-time methods to improve QA performance on misleading visualizations, without compromising accuracy on non-misleading ones. We find that two methods, table-based QA and redrawing the visualization, are effective, with improvements of up to 19.6 percentage points. We make our code and data available.
中文摘要:多模态大语言模型在误导性图表面前存在显著漏洞,其问答准确率会降至随机基线水平,但采用基于表格的问答和图表重绘等新型推理时方法可提升多达19.6个百分点的性能,同时不影响正常图表的处理精度。
English Summary: Multimodal large language models exhibit significant vulnerability to misleading visualizations, dropping to random baseline accuracy, but new inference-time methods like table-based QA and chart redrawing can improve performance by up to 19.6 percentage points without affecting standard chart accuracy.

Authors:Tianyi Lorena Yan, Robin Jia
Title: Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Abstract:
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets, models, and prompt templates, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.
中文: 语言模型采用“先促进后抑制”机制,通过注意力与多层感知机先召回全部事实答案再抑制已生成内容,这一机制经Token Lens和敲除分析等实验方法得到验证。
English: Language models employ a promote-then-suppress mechanism, using attention and MLPs to first recall all factual answers and then suppress previously generated ones, as validated through experimental techniques like Token Lens and knockout analysis.

Authors:Yiheng Liu, Xiaohui Gao, Haiyang Sun, Bao Ge, Tianming Liu, Junwei Han, Xintao Hu
Title: Brain-Inspired Exploration of Functional Networks and Key Neurons in Large Language Models
Abstract:
In recent years, the rapid advancement of large language models (LLMs) in natural language processing has sparked significant interest among researchers to understand their mechanisms and functional characteristics. Although existing studies have attempted to explain LLM functionalities by identifying and interpreting specific neurons, these efforts mostly focus on individual neuron contributions, neglecting the fact that human brain functions are realized through intricate interaction networks. Inspired by cognitive neuroscience research on functional brain networks (FBNs), this study introduces a novel approach to investigate whether similar functional networks exist within LLMs. We use methods similar to those in the field of functional neuroimaging analysis to locate and identify functional networks in LLM. Experimental results show that, similar to the human brain, LLMs contain functional networks that frequently recur during operation. Further analysis shows that these functional networks are crucial for LLM performance. Masking key functional networks significantly impairs the model's performance, while retaining just a subset of these networks is adequate to maintain effective operation. This research provides novel insights into the interpretation of LLMs and the lightweighting of LLMs for certain downstream tasks. Code is available at https://github.com/WhatAboutMyStar/LLM_ACTIVATION.
Chinese: 本研究受脑功能网络启发,提出了一种识别大语言模型中重复出现的功能网络的新方法,揭示了这些网络对模型性能的关键作用及其在模型轻量化方面的潜力。
English: This study introduces a novel approach inspired by functional brain networks to identify recurring functional networks in large language models, demonstrating their critical role in model performance and potential for model lightweighting.

Authors:Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury
Title: Multi-Turn Code Generation Through Single-Step Rewards
Abstract:
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $μ$Code, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $μ$Code iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over the state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $μ$Code at utilizing the execution feedback. Our code is available at https://github.com/portal-cornell/muCode.
Chinese: 提出的$μ$Code方法通过使用单步奖励,训练生成器和验证器基于执行反馈迭代改进代码解决方案,为多轮代码生成提供了一种简单且可扩展的途径,相比现有方法取得了显著性能提升。
English: The proposed $μ$Code method introduces a simple and scalable approach to multi-turn code generation by using single-step rewards, training both a generator and a verifier to iteratively improve code solutions based on execution feedback, achieving significant performance gains over existing methods.

Authors:Albert Gong, Kamilė Stankevičiūtė, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P. Gomes, Kilian Q. Weinberger
Title: PhantomWiki: On-Demand Datasets for Reasoning and Retrieval Evaluation
Abstract:
High-quality benchmarks are essential for evaluating reasoning and retrieval capabilities of large language models (LLMs). However, curating datasets for this purpose is not a permanent solution as they are prone to data leakage and inflated performance results. To address these challenges, we propose PhantomWiki: a pipeline to generate unique, factually consistent document corpora with diverse question-answer pairs. Unlike prior work, PhantomWiki is neither a fixed dataset, nor is it based on any existing data. Instead, a new PhantomWiki instance is generated on demand for each evaluation. We vary the question difficulty and corpus size to disentangle reasoning and retrieval capabilities respectively, and find that PhantomWiki datasets are surprisingly challenging for frontier LLMs. Thus, we contribute a scalable and data leakage-resistant framework for disentangled evaluation of reasoning, retrieval, and tool-use abilities. Our code is available at https://github.com/kilian-group/phantom-wiki.
中文: PhantomWiki是一种创新的流程,可按需生成独特的文档库和问答对,用于评估大型语言模型,有效解决数据泄露问题,并能够分别评估推理和检索能力。
English: PhantomWiki is a novel pipeline that generates unique, on-demand document corpora and question-answer pairs to evaluate large language models, effectively addressing data leakage and enabling disentangled assessment of reasoning and retrieval capabilities.

Authors:Liang Chen, Shuai Bai, Wenhao Chai, Weichu Xie, Haozhe Zhao, Leon Vinci, Junyang Lin, Baobao Chang
Title: Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think
Abstract:
The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.
Chinese: Dream Engine框架通过将多模态编码器与扩散模型结合,采用两阶段训练方法,有效解决了图像生成中文本与图像交替控制的统一性问题,并在性能上达到先进水平。
English: The Dream Engine framework addresses the lack of unified text-image interleaved control in image generation by integrating multimodal encoders like QwenVL with diffusion models, achieving competitive performance through a two-stage training approach.

Authors:Yifan Zhang, Wenyu Du, Dongming Jin, Jie Fu, Zhi Jin
Title: Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Abstract:
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. Our key contributions are: (1) We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit (a subset of model components, responsible for tracking the world state), indicating that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three challenging settings: skipping intermediate steps, introducing data noises, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSAs), highlighting its resilience in challenging scenarios. Our code is available at https://github.com/IvanChangPKU/FSA.
中文: 思维链(CoT)通过激活后层MLP神经元形成隐式有限状态自动机,显著增强Transformer模型的性能,并在复杂场景中展现出强大鲁棒性。
English: Chain-of-thought (CoT) boosts Transformer model performance by enabling implicit finite state automata through late-layer MLP neurons, demonstrating robustness in challenging scenarios.

Authors:Tergel Munkhbat, Namgyu Ho, Seo Hyun Kim, Yongjin Yang, Yujin Kim, Se-Young Yun
Title: Self-Training Elicits Concise Reasoning in Large Language Models
Abstract:
Chain-of-thought (CoT) reasoning has enabled large language models (LLMs) to utilize additional computation through intermediate tokens to solve complex tasks. However, we posit that typical reasoning traces contain many redundant tokens, incurring extraneous inference costs. Upon examination of the output distribution of current LLMs, we find evidence on their latent ability to reason more concisely, relative to their default behavior. To elicit this capability, we propose simple fine-tuning methods which leverage self-generated concise reasoning paths obtained by best-of-N sampling and few-shot conditioning, in task-specific settings. Our combined method achieves a 30% reduction in output tokens on average, across five model families on GSM8K and MATH, while maintaining average accuracy. By exploiting the fundamental stochasticity and in-context learning capabilities of LLMs, our self-training approach robustly elicits concise reasoning on a wide range of models, including those with extensive post-training. Code is available at https://github.com/TergelMunkhbat/concise-reasoning
中文: 思维链推理常产生冗余标记,但通过基于自生成简洁路径的微调,模型能在保持准确性的同时平均减少30%的输出长度。
English: Chain-of-thought reasoning in LLMs often produces redundant tokens, but by fine-tuning with self-generated concise paths, models can reduce output length by 30% while maintaining accuracy.

Authors:Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang
Title: LongRoPE2: Near-Lossless LLM Context Window Scaling
Abstract:
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.
中文: LongRoPE2是一种新颖方法,通过进化搜索的RoPE缩放算法和混合上下文窗口训练,在保持原始短上下文性能的同时,将预训练大语言模型的有效上下文窗口扩展至目标长度。
English: LongRoPE2 is a novel method that extends LLMs' effective context window to target lengths while maintaining original short-context performance through addressing insufficient RoPE dimension training with evolutionary search-based rescaling and mixed context window fine-tuning.

Authors:Haochen Sun, Shuwen Zhang, Lujie Niu, Lei Ren, Hao Xu, Hao Fu, Fangkun Zhao, Caixia Yuan, Xiaojie Wang
Title: Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
Abstract:
Large Language Models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-based Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks in two novel ways. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments with 13 popular LLMs and show that, while the LLMs exhibit a strong ability in goal interpretation, there are significant shortcomings in active collaboration and continuous adaptation, which are critical for efficiently fulfilling complex tasks. Notably, we highlight the strengths and weaknesses of LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-source benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.
中文: 本文提出基于Overcooked-AI开发的Collab-Overcooked多智能体基准测试,通过开放式协作任务和过程导向评估指标,揭示大语言模型虽在目标理解表现优异,但在主动协作和持续适应方面存在明显不足。
English: This paper introduces Collab-Overcooked, a novel LLM-based multi-agent benchmark built on Overcooked-AI that features enhanced collaborative tasks and process-oriented evaluation metrics, revealing LLMs' strengths in goal interpretation but deficiencies in active collaboration and adaptation.

Authors:Huazheng Wang, Yongcheng Jing, Haifeng Sun, Yingjie Wang, Jingyu Wang, Jianxin Liao, Dacheng Tao
Title: Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models
Abstract:
In this paper, we investigate knowledge forgetting in large language models with a focus on its generalisation--ensuring that models forget not only specific training samples but also related implicit knowledge. To this end, we begin by identifying a broader unlearning scope that includes both target data and logically associated samples, including rephrased, subject-replaced, one-hop reasoned, and relation-reversed data. To rigorously evaluate generalisation, we introduce UGBench, the first comprehensive benchmark specifically designed to assess the unlearning of in-scope implicit knowledge covering 13 state-of-the-art methods across three datasets. UGBench reveals that unlearned models can still recall paraphrased answers and retain target facts in intermediate layers. This motivates us to take a preliminary step toward more generalised implicit knowledge forgetting by proposing PerMU, a novel probability perturbation-based unlearning paradigm. PerMU simulates adversarial unlearning samples to eliminate fact-related tokens from the logit distribution, collectively reducing the probabilities of all answer-associated tokens. Experiments are conducted on a diverse range of datasets, including TOFU, Harry Potter, ZsRE, WMDP, and MUSE, using models ranging from 1.3B to 13B in scale. The results demonstrate that PerMU delivers up to a 50.40% improvement in unlearning vanilla target data while maintaining a 40.73% boost in forgetting implicit knowledge. Our code can be found in https://github.com/MaybeLizzy/UGBench.
中文摘要:本文提出PerMU这一基于概率扰动的新型遗忘方法,通过模拟对抗性遗忘样本来消除显性目标数据及相关隐性知识,从而显著提升大语言模型的广义知识遗忘能力。
English Summary: This paper introduces PerMU, a novel probability perturbation-based unlearning method designed to enhance generalized knowledge forgetting in large language models by eliminating both explicit target data and related implicit knowledge through adversarial sample simulation.

Authors:Xiang Geng, Zhejian Lai, Jiajun Chen, Hao Yang, Shujian Huang
Title: Alleviating Distribution Shift in Synthetic Data for Machine Translation Quality Estimation
Abstract:
Quality Estimation (QE) models evaluate the quality of machine translations without reference translations, serving as the reward models for the translation task. Due to the data scarcity, synthetic data generation has emerged as a promising solution. However, synthetic QE data often suffers from distribution shift, which can manifest as discrepancies between pseudo and real translations, or in pseudo labels that do not align with human preferences. To tackle this issue, we introduce DCSQE, a novel framework for alleviating distribution shift in synthetic QE data. To reduce the difference between pseudo and real translations, we employ the constrained beam search algorithm and enhance translation diversity through the use of distinct generation models. DCSQE uses references, i.e., translation supervision signals, to guide both the generation and annotation processes, enhancing the quality of token-level labels. DCSQE further identifies the shortest phrase covering consecutive error tokens, mimicking human annotation behavior, to assign the final phrase-level labels. Specially, we underscore that the translation model can not annotate translations of itself accurately. Extensive experiments demonstrate that DCSQE outperforms SOTA baselines like CometKiwi in both supervised and unsupervised settings. Further analysis offers insights into synthetic data generation that could benefit reward models for other tasks. The code is available at https://github.com/NJUNLP/njuqe.
中文: DCSQE是一种新颖框架,通过采用约束束搜索、增强翻译多样性以及利用参考译文指导生成和标注过程,有效缓解合成质量估计数据中的分布偏移问题,从而提升词级和短语级标签的质量。
English: DCSQE is a novel framework designed to mitigate distribution shift in synthetic Quality Estimation data by employing constrained beam search, enhancing translation diversity, and using references to guide generation and annotation, ultimately improving token- and phrase-level label quality.

Authors:Zhenyu Liu, Yunxin Li, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang
Title: Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents
Abstract:
To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via visual agents collaboration to select images with rich visual information, and 2) a visual-centric instruction quality assessment method to select high-quality instruction data related to high-quality images. Finally, we reorganize 80K instruction data from large open-source datasets. Extensive experiments demonstrate that ViSA outperforms or is comparable to current state-of-the-art models on seven benchmarks, using only 2.5\% of the original data, highlighting the efficiency of our data selection approach. Moreover, we conduct ablation studies to validate the effectiveness of each component of our method. The code is available at https://github.com/HITsz-TMG/ViSA.
中文摘要:研究者提出ViSA方法,通过视觉代理协作筛选高质量图像和相关指令来优化多模态大语言模型,仅用2.5%的数据即在七个基准测试中达到最优性能。
English Summary: Researchers propose ViSA, a visual-centric data selection method using agent collaboration to enhance MLLMs by filtering high-quality images and relevant instructions, achieving state-of-the-art performance with only 2.5% of data across seven benchmarks.

Authors:Zixuan Weng, Xiaolong Jin, Jinyuan Jia, Xiangyu Zhang
Title: Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Abstract:
Ensuring AI safety is crucial as large language models become increasingly integrated into real-world applications. A key challenge is jailbreak, where adversarial prompts bypass built-in safeguards to elicit harmful disallowed outputs. Inspired by psychological foot-in-the-door principles, we introduce FITD,a novel multi-turn jailbreak method that leverages the phenomenon where minor initial commitments lower resistance to more significant or more unethical transgressions. Our approach progressively escalates the malicious intent of user queries through intermediate bridge prompts and aligns the model's response by itself to induce toxic responses. Extensive experimental results on two jailbreak benchmarks demonstrate that FITD achieves an average attack success rate of 94% across seven widely used models, outperforming existing state-of-the-art methods. Additionally, we provide an in-depth analysis of LLM self-corruption, highlighting vulnerabilities in current alignment strategies and emphasizing the risks inherent in multi-turn interactions. The code is available at https://github.com/Jinxiaolong1129/Foot-in-the-door-Jailbreak.
中文摘要:FITD是一种受心理学原理启发的新型多轮越狱方法,通过逐步升级恶意查询来绕过AI安全防护,实现了94%的攻击成功率,揭示了当前对齐策略中的漏洞。
English Summary: FITD is a novel multi-turn jailbreak method inspired by psychological principles that progressively escalates malicious queries to bypass AI safeguards, achieving a 94% attack success rate and exposing vulnerabilities in current alignment strategies.

Authors:Jinhao Pan, Chahat Raj, Ziyu Yao, Ziwei Zhu
Title: What's Not Said Still Hurts: A Description-Based Evaluation Framework for Measuring Social Bias in LLMs
Abstract:
Large Language Models (LLMs) often exhibit social biases inherited from their training data. While existing benchmarks evaluate bias by term-based mode through direct term associations between demographic terms and bias terms, LLMs have become increasingly adept at avoiding biased responses, leading to seemingly low levels of bias. However, biases persist in subtler, contextually hidden forms that traditional benchmarks fail to capture. We introduce the Description-based Bias Benchmark (DBB), a novel dataset designed to assess bias at the semantic level that bias concepts are hidden within naturalistic, subtly framed contexts in real-world scenarios rather than superficial terms. We analyze six state-of-the-art LLMs, revealing that while models reduce bias in response at the term level, they continue to reinforce biases in nuanced settings. Data, code, and results are available at https://github.com/JP-25/Description-based-Bias-Benchmark.
Chinese: 描述性偏见基准(DBB)是一种新颖的数据集,旨在通过自然语境中隐藏的语义层面评估大型语言模型的偏见,发现尽管模型在表层术语上减少了偏见,但在微妙情境中仍持续强化偏见。
English: The Description-based Bias Benchmark (DBB) is a new dataset that uncovers subtle, contextually embedded biases in Large Language Models, which traditional term-based evaluations miss, revealing that models still reinforce biases in nuanced scenarios despite appearing less biased superficially.

Authors:Hannah Cyberey, Yangfeng Ji, David Evans
Title: Unsupervised Concept Vector Extraction for Bias Control in LLMs
Abstract:
Large language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of "gender" is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model's representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias. Our code is available at: https://github.com/hannahxchen/gender-bias-steering
中文: 研究人员基于表征工程开发了一种投影方法,无需标注数据即可精确测量和操控大语言模型中的性别与种族偏见,有效缓解了刻板印象问题。
English: Researchers developed a projection-based method using representation engineering to precisely measure and manipulate gender and racial bias in large language models, effectively mitigating stereotypes without requiring labeled data.

Authors:Dayu Yang, Tianyang Liu, Daoan Zhang, Antoine Simoulin, Xiaoyi Liu, Yuwei Cao, Zhaopu Teng, Xin Qian, Grey Yang, Jiebo Luo, Julian McAuley
Title: Code to Think, Think to Code: A Survey on Code-Enhanced Reasoning and Reasoning-Driven Code Intelligence in LLMs
Abstract:
In large language models (LLMs), code and reasoning reinforce each other: code offers an abstract, modular, and logic-driven structure that supports reasoning, while reasoning translates high-level goals into smaller, executable steps that drive more advanced code intelligence. In this study, we examine how code serves as a structured medium for enhancing reasoning: it provides verifiable execution paths, enforces logical decomposition, and enables runtime validation. We also explore how improvements in reasoning have transformed code intelligence from basic completion to advanced capabilities, enabling models to address complex software engineering tasks through planning and debugging. Finally, we identify key challenges and propose future research directions to strengthen this synergy, ultimately improving LLM's performance in both areas.
中文摘要:代码与推理在大语言模型中相互促进,代码提供结构化逻辑框架,推理通过规划和调试实现高级代码智能,未来研究将致力于强化这种协同作用以提升模型性能。
English Summary: Code and reasoning mutually enhance each other in large language models, with code providing structured logical frameworks and reasoning enabling complex code intelligence through planning and debugging, while future research aims to strengthen this synergy.

Authors:Danae Sánchez Villegas, Ingo Ziegler, Desmond Elliott
Title: ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Abstract:
Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
中文: ImageChain通过将图像序列建模为多轮对话,增强了多模态大语言模型的顺序推理能力,在下一场景描述等任务中显著提升性能并实现强大的跨领域泛化。
English: ImageChain enhances multimodal large language models by modeling image sequences as multi-turn conversations, significantly improving sequential reasoning and achieving robust performance in tasks like next-scene description.

Authors:Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li
Title: Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
Abstract:
Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).
中文: 本文提出代理奖励建模方法,将人类偏好奖励与可验证的正确性信号(如事实性和指令遵循)相结合,为大型语言模型提供更可靠的奖励系统,实验证明其在各项基准测试和实际任务中显著优于传统奖励模型。
English: This paper introduces agentic reward modeling, which integrates human preference rewards with verifiable correctness signals like factuality and instruction following to create more reliable reward systems for large language models, demonstrating superior performance over traditional methods in experiments and downstream tasks.

Authors:Honglin Guo, Kai Lv, Qipeng Guo, Tianyi Liang, Zhiheng Xi, Demin Song, Qiuyinzhe Zhang, Yu Sun, Kai Chen, Xipeng Qiu, Tao Gui
Title: CritiQ: Mining Data Quality Criteria from Human Preferences
Abstract:
Language model heavily depends on high-quality data for optimal performance. Existing approaches rely on manually designed heuristics, the perplexity of existing models, training classifiers, or careful prompt engineering, which require significant expert experience and human annotation effort while introduce biases. We introduce CritiQ, a novel data selection method that automatically mines criteria from human preferences for data quality with only ~30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments. We build a knowledge base that extracts quality criteria from previous work to boost CritiQ Flow. Compared to perplexity- and classifier- based methods, verbal criteria are more interpretable and possess reusable value. After deriving the criteria, we train the CritiQ Scorer to give quality scores and perform efficient data selection. We demonstrate the effectiveness of our method in the code, math, and logic domains, achieving high accuracy on human-annotated test sets. To validate the quality of the selected data, we continually train Llama 3.1 models and observe improved performance on downstream tasks compared to uniform sampling. Ablation studies validate the benefits of the knowledge base and the reflection process. We analyze how criteria evolve and the effectiveness of majority voting.
中文:CritiQ是一种新颖的数据选择方法,仅需少量人工标注即可自动从人类偏好中提取质量标准并高效筛选数据,在代码、数学和逻辑领域优于现有方法,同时提升模型性能。
English: CritiQ is a novel data selection method that automatically derives quality criteria from minimal human feedback and efficiently selects high-quality data, outperforming existing approaches in code, math, and logic domains while improving model performance.

Authors:Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang
Title: Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation
Abstract:
Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon at https://github.com/OpenSPG/KAG.
Chinese: Bi'an框架通过提供双语基准数据集和轻量级评判模型,解决了RAG幻觉检测中的局限性,其140亿参数模型在性能上超越了规模更大的基线模型,并能与顶尖闭源大语言模型相媲美。
English: The Bi'an framework addresses limitations in RAG hallucination detection by providing a bilingual benchmark dataset and lightweight judge models, with its 14B parameter model outperforming larger baselines and competing with top closed-source LLMs.

Authors:Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
Title: BIG-Bench Extra Hard
Abstract:
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
Chinese: 针对现有推理基准如BIG-Bench Hard的性能饱和问题,新推出的BIG-Bench Extra Hard(BBEH)基准通过引入更具挑战性的任务,揭示了当前大语言模型在通用推理能力上的显著差距,凸显了持续改进的必要性。
English: To address the saturation of existing reasoning benchmarks like BIG-Bench Hard, the new BIG-Bench Extra Hard (BBEH) benchmark introduces more challenging tasks, revealing significant performance gaps and underscoring the ongoing need for improved general reasoning in large language models.

Authors:Henry Peng Zou, Zhengyao Gu, Yue Zhou, Yankai Chen, Weizhi Zhang, Liancheng Fang, Yibo Wang, Yangning Li, Kay Liu, Philip S. Yu
Title: TestNUC: Enhancing Test-Time Computing Approaches and Scaling through Neighboring Unlabeled Data Consistency
Abstract:
Test-time computing approaches, which leverage additional computational resources during inference, have been proven effective in enhancing large language model performance. This work introduces a novel, linearly scaling approach, TestNUC, that improves test-time predictions by leveraging the local consistency of neighboring unlabeled data-it classifies an input instance by considering not only the model's prediction on that instance but also on neighboring unlabeled instances. We evaluate TestNUC across eight diverse datasets, spanning intent classification, topic mining, domain discovery, and emotion detection, demonstrating its consistent superiority over baseline methods such as standard prompting and self-consistency. Furthermore, TestNUC can be seamlessly integrated with existing test-time computing approaches, substantially boosting their performance. Our analysis reveals that TestNUC scales effectively with increasing amounts of unlabeled data and performs robustly across different embedding models, making it practical for real-world applications. Our code is available at https://github.com/HenryPengZou/TestNUC.
中文: TestNUC是一种创新的测试时计算方法,通过利用相邻未标记数据的局部一致性来提高预测精度,在多个数据集上表现优于基线方法,并能与现有方法无缝集成。
English: TestNUC is a novel test-time computing method that enhances prediction accuracy by leveraging the local consistency of neighboring unlabeled data, demonstrating consistent superiority across diverse datasets and seamless integration with existing approaches.

Authors:Zhaowei Zhang, Fengshuo Bai, Qizhi Chen, Chengdong Ma, Mingzhi Wang, Haoran Sun, Zilong Zheng, Yaodong Yang
Title: Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs
Abstract:
How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse regarding culture, values, or time. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduce Amulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users' personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

Authors:Michelle Kappl
Title: Are All Spanish Doctors Male? Evaluating Gender Bias in German Machine Translation
Abstract:
We present WinoMTDE, a new gender bias evaluation test set designed to assess occupational stereotyping and underrepresentation in German machine translation (MT) systems. Building on the automatic evaluation method introduced by arXiv:1906.00591v1, we extend the approach to German, a language with grammatical gender. The WinoMTDE dataset comprises 288 German sentences that are balanced in regard to gender, as well as stereotype, which was annotated using German labor statistics. We conduct a large-scale evaluation of five widely used MT systems and a large language model. Our results reveal persistent bias in most models, with the LLM outperforming traditional systems. The dataset and evaluation code are publicly available under https://github.com/michellekappl/mt_gender_german.
中文摘要:WinoMTDE是一个用于评估德语机器翻译系统中职业性别偏见的数据集,研究发现大多数模型存在持续偏见,其中大型语言模型表现最优。
English Summary: WinoMTDE is a German gender bias evaluation dataset designed to test occupational stereotyping in machine translation systems, revealing persistent biases in most models with large language models performing best.

Authors:Siwei Wu, Yizhi Li, Xingwei Qu, Rishi Ravikumar, Yucheng Li, Tyler Loakman, Shanghaoran Quan, Xiaoyong Wei, Riza Batista-Navarro, Chenghua Lin
Title: LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm
Abstract:
Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in https://github.com/Wusiwei0410/LongEval.
中文: 大型语言模型在生成长文本方面存在困难,为此我们开发了LongEval基准,通过不同生成方法评估性能,发现经过长文本训练的小型模型也能达到大型模型的水平。
English: Large Language Models struggle with long-text generation, leading to the creation of LongEval, a benchmark that evaluates performance across different generation methods and reveals that smaller models trained on long texts can match larger ones.

Authors:Yiheng Yang, Yujie Wang, Chi Ma, Lei Yu, Emmanuele Chersoni, Chu-Ren Huang
Title: Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs
Abstract:
Dense large language models(LLMs) face critical efficiency bottlenecks as they rigidly activate all parameters regardless of input complexity. While existing sparsity methods(static pruning or dynamic activation) address this partially, they either lack adaptivity to contextual or model structural demands or incur prohibitive computational overhead. Inspired by human brain's dual-process mechanisms - predictive coding (N400) for backbone sparsity and structural reanalysis (P600) for complex context - we propose CLADA, a \textit{\textbf{C}ognitive-\textbf{L}oad-\textbf{A}ware \textbf{D}ynamic \textbf{A}ctivation} framework that synergizes statistical sparsity with semantic adaptability. Our key insight is that LLM activations exhibit two complementary patterns: 1) \textit{Global statistical sparsity} driven by sequence-level prefix information, and 2) \textit{Local semantic adaptability} modulated by cognitive load metrics(e.g., surprisal and entropy). CLADA employs a hierarchical thresholding strategy: a baseline from offline error-controlled optimization ensures 40\%+ sparsity, dynamically adjusted by real-time cognitive signals. Evaluations across six mainstream LLMs and nine benchmarks demonstrate that CLADA achieves \textbf{~20\% average speedup with <2\% accuracy drop}, outperforming Griffin (5\%+ degradation) and TT (negligible speedup). Crucially, we establish the first formal connection between neurolinguistic event-related potential (ERP) components and LLM efficiency mechanisms through multi-level regression analysis ($R^2=0.17$ for sparsity-adaptation synergy). Requiring no retraining or architectural changes, CLADA offers a deployable solution for resource-aware LLM inference while advancing biologically-inspired AI design. Our code is available at \href{https://github.com/Oldify/CLADA}{CLADA}.
中文摘要:CLADA是一种认知负载感知的动态激活框架,通过结合统计稀疏性与语义适应性来提升大语言模型效率,在实现约20%加速的同时保持精度损失低于2%,并建立了与大脑神经语言机制的首次正式关联。
English Summary: CLADA is a cognitive-load-aware dynamic activation framework that enhances LLM efficiency by combining statistical sparsity with semantic adaptability, achieving ~20% speedup with minimal accuracy loss while establishing a neurolinguistic connection to brain mechanisms.

Authors:Ujjwal Singh, Aditi Sharma, Nikhil Gupta, Deepakshi, Vivek Kumar Jha
Title: IndicEval-XL: Bridging Linguistic Diversity in Code Generation Across Indic Languages
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation from natural language prompts, revolutionizing software development workflows. As we advance towards agent-based development paradigms, these models form the cornerstone of next-generation software development lifecycles. However, current benchmarks for evaluating multilingual code generation capabilities are predominantly English-centric, limiting their applicability across the global developer community. To address this limitation, we present IndicEval-XL, a comprehensive benchmark for code generation that incorporates 6 major Indic languages, collectively spoken by approximately 14\% of the world's population. Our benchmark bridges these languages with 12 programming languages, creating a robust evaluation framework. This work is particularly significant given India's representation of one-eighth of the global population and the crucial role Indic languages play in Indian society. IndicEval-XL represents a significant step toward expanding the linguistic diversity in code generation systems and evaluation frameworks. By developing resources that support multiple languages, we aim to make AI-powered development tools more inclusive and accessible to developers of various linguistic backgrounds. To facilitate further research and development in this direction, we make our dataset and evaluation benchmark publicly available at https://github.com/telekom/IndicEval-XL
中文:IndicEval-XL 是一个涵盖六种主要印度语言与十二种编程语言的综合性多语言代码生成基准,通过公开数据集解决了当前以英语为中心的限制,推动了人工智能开发工具的包容性发展。
English: IndicEval-XL is a comprehensive multilingual code generation benchmark that integrates six major Indic languages with twelve programming languages, addressing the current English-centric limitations and promoting inclusivity in AI development tools by making the dataset publicly available.

Authors:Hao Liang, Meiyi Qiang, Yuying Li, Zefeng He, Yongzhen Guo, Zhengzhou Zhu, Wentao Zhang, Bin Cui
Title: MathClean: A Benchmark for Synthetic Mathematical Data Cleaning
Abstract:
With the rapid development of large language models (LLMs), the quality of training data has become crucial. Among the various types of training data, mathematical data plays a key role in enabling LLMs to acquire strong reasoning abilities. While high-quality open-source data is important, it is often insufficient for pre-training, necessitating the addition of synthetic math problems. However, synthetic math questions and answers can introduce inaccuracies, which may degrade both the training data and web data. Therefore, an effective method for cleaning synthetic math data is essential. In this paper, we propose the MathClean benchmark to evaluate the effectiveness of math data cleaning models. The MathClean benchmark consists of 2,000 correct questions and 2,000 erroneous questions with additional 2,000 correct and erroneous answers sourced from augmented data based on GSM8K and MATH. Moreover, we also annotate error types for each question or answer, since it can assess whether models can correctly identify the error categories for future improvements. Finally, we present comprehensive evaluations using state-of-the-art (SOTA) models. Our results demonstrate that even strong models like GPT-o1 and DeepSeek-R1 perform poorly on this benchmark, highlighting the utility of MathClean. Our code and data is available at https://github.com/YuYingLi0/MathClean.
中文: MathClean基准被提出用于评估数学数据清洗模型,包含4000个标注的问题和答案以解决合成训练数据中的错误,测试表明即使是GPT-o1和DeepSeek-R1等先进模型在此任务上也表现不佳。
English: The MathClean benchmark is introduced to evaluate math data cleaning models, comprising 4,000 annotated questions and answers to address inaccuracies in synthetic training data, with tests showing even advanced models like GPT-o1 and DeepSeek-R1 struggle on this task.

Authors:Jakub Macina, Nico Daheim, Ido Hakimi, Manu Kapur, Iryna Gurevych, Mrinmaya Sachan
Title: MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
Abstract:
Evaluating the pedagogical capabilities of AI-based tutoring models is critical for making guided progress in the field. Yet, we lack a reliable, easy-to-use, and simple-to-run evaluation that reflects the pedagogical abilities of models. To fill this gap, we present MathTutorBench, an open-source benchmark for holistic tutoring model evaluation. MathTutorBench contains a collection of datasets and metrics that broadly cover tutor abilities as defined by learning sciences research in dialog-based teaching. To score the pedagogical quality of open-ended teacher responses, we train a reward model and show it can discriminate expert from novice teacher responses with high accuracy. We evaluate a wide set of closed- and open-weight models on MathTutorBench and find that subject expertise, indicated by solving ability, does not immediately translate to good teaching. Rather, pedagogy and subject expertise appear to form a trade-off that is navigated by the degree of tutoring specialization of the model. Furthermore, tutoring appears to become more challenging in longer dialogs, where simpler questioning strategies begin to fail. We release the benchmark, code, and leaderboard openly to enable rapid benchmarking of future models.

Authors:Shuyi Liu, Simiao Cui, Haoran Bu, Yuming Shang, Xi Zhang
Title: JailBench: A Comprehensive Chinese Security Assessment Benchmark for Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities across various applications, highlighting the urgent need for comprehensive safety evaluations. In particular, the enhanced Chinese language proficiency of LLMs, combined with the unique characteristics and complexity of Chinese expressions, has driven the emergence of Chinese-specific benchmarks for safety assessment. However, these benchmarks generally fall short in effectively exposing LLM safety vulnerabilities. To address the gap, we introduce JailBench, the first comprehensive Chinese benchmark for evaluating deep-seated vulnerabilities in LLMs, featuring a refined hierarchical safety taxonomy tailored to the Chinese context. To improve generation efficiency, we employ a novel Automatic Jailbreak Prompt Engineer (AJPE) framework for JailBench construction, which incorporates jailbreak techniques to enhance assessing effectiveness and leverages LLMs to automatically scale up the dataset through context-learning. The proposed JailBench is extensively evaluated over 13 mainstream LLMs and achieves the highest attack success rate against ChatGPT compared to existing Chinese benchmarks, underscoring its efficacy in identifying latent vulnerabilities in LLMs, as well as illustrating the substantial room for improvement in the security and trustworthiness of LLMs within the Chinese context. Our benchmark is publicly available at https://github.com/STAIR-BUPT/JailBench.
中文摘要:JailBench是首个针对大语言模型深层漏洞的中文综合评测基准,采用自动越狱提示工程框架,相比现有基准在ChatGPT上实现了更高的攻击成功率。
English Summary: JailBench is introduced as the first comprehensive Chinese benchmark designed to evaluate deep-seated vulnerabilities in large language models, employing an automatic jailbreak prompt engineering framework to achieve higher attack success rates than existing benchmarks.

Authors:Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
Title: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Abstract:
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.
Chinese: TOKENSWIFT 是一种创新框架,通过消除频繁的模型重载、优化动态键值管理和减少重复生成,解决了超长序列生成中的关键瓶颈,在保持输出质量的同时,在不同规模和架构的模型上实现了超过3倍的加速效果。
English: TOKENSWIFT is a novel framework that addresses key bottlenecks in ultra-long sequence generation by eliminating frequent model reloading, optimizing KV management, and reducing repetitive generation, achieving over 3x speedup across various model scales and architectures while preserving output quality.

Authors:Tianyun Liu
Title: Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Abstract:
Traditional text-to-speech (TTS) methods primarily focus on establishing a mapping between phonemes and mel-spectrograms. However, during the phoneme encoding stage, there is often a lack of real mel-spectrogram auxiliary information, which results in the encoding process lacking true semantic understanding. At the same time, traditional TTS systems often struggle to balance the inference speed of the model with the quality of the synthesized speech. Methods that generate high-quality synthesized speech tend to have slower inference speeds, while faster inference methods often sacrifice speech quality. In this paper, I propose Clip-TTS, a TTS method based on the Clip architecture. This method uses the Clip framework to establish a connection between text content and real mel-spectrograms during the text encoding stage, enabling the text encoder to directly learn the true semantics of the global context, thereby ensuring the quality of the synthesized speech. In terms of model architecture, I adopt the basic structure of Transformer, which allows Clip-TTS to achieve fast inference speeds. Experimental results show that on the LJSpeech and Baker datasets, the speech generated by Clip-TTS achieves state-of-the-art MOS scores, and it also performs excellently on multi-emotion datasets.Audio samples are available at: https://ltydd1314.github.io/.

Authors:Zichuan Fu, Wentao Song, Yejing Wang, Xian Wu, Yefeng Zheng, Yingying Zhang, Derong Xu, Xuetao Wei, Tong Xu, Xiangyu Zhao
Title: Sliding Window Attention Training for Efficient Large Language Models
Abstract:
Recent advances in transformer-based Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks. However, their quadratic computational complexity concerning sequence length remains a significant bottleneck for processing long documents. As a result, many efforts like sparse attention and state space models have been proposed to improve the efficiency of LLMs over long sequences. Though effective, these approaches compromise the performance or introduce structural complexity. This calls for a simple yet efficient model that preserves the fundamental Transformer architecture. To this end, we introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training. This paper first attributes the inefficiency of Transformers to the attention sink phenomenon resulting from the high variance of softmax operation. Then, we replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention. Experiments demonstrate that SWAT achieves SOTA performance compared with state-of-the-art linear recurrent architectures on eight benchmarks. Code is available at https://github.com/Fzkuji/swat-attention.
Chinese Summary: 本文提出SWAT模型,通过用sigmoid替换softmax并结合平衡位置嵌入,有效提升了Transformer处理长文本的能力,在多个基准测试中实现了最优性能。
English Summary: The paper introduces SWAT, a model that enhances long-context processing in Transformers by replacing softmax with sigmoid and using balanced position embeddings, achieving state-of-the-art efficiency and performance.

Authors:Shuliang Liu, Xinze Li, Zhenghao Liu, Yukun Yan, Cheng Yang, Zheni Zeng, Zhiyuan Liu, Maosong Sun, Ge Yu
Title: Judge as A Judge: Improving the Evaluation of Retrieval-Augmented Generation through the Judge-Consistency of Large Language Models
Abstract:
Retrieval-Augmented Generation (RAG) has proven its effectiveness in alleviating hallucinations for Large Language Models (LLMs). However, existing automated evaluation metrics cannot fairly evaluate the outputs generated by RAG models during training and evaluation. LLM-based judgment models provide the potential to produce high-quality judgments, but they are highly sensitive to evaluation prompts, leading to inconsistencies when judging the output of RAG models. This paper introduces the Judge-Consistency (ConsJudge) method, which aims to enhance LLMs to generate more accurate evaluations for RAG models. Specifically, ConsJudge prompts LLMs to generate different judgments based on various combinations of judgment dimensions, utilize the judge-consistency to evaluate these judgments and select the accepted and rejected judgments for DPO training. Our experiments show that ConsJudge can effectively provide more accurate judgments for optimizing RAG models across various RAG models and datasets. Further analysis reveals that judgments generated by ConsJudge have a high agreement with the superior LLM. All codes are available at https://github.com/OpenBMB/ConsJudge.
中文: 本文提出Judge-Consistency方法,通过生成多维度判断并利用一致性筛选,有效提升大语言模型对RAG模型的评估准确性,在不同模型和数据集上均能优化RAG性能。
English: This paper introduces the Judge-Consistency (ConsJudge) method to enhance LLMs' evaluation accuracy for RAG models by generating diverse judgments and using consistency for selection, effectively improving RAG optimization across various models and datasets.

Authors:Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao
Title: Reward Shaping to Mitigate Reward Hacking in RLHF
Abstract:
Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. We evaluated PAR on two base models, Gemma2-2B, and Llama3-8B, using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR's superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR, and the Work done during the internship at StepFun by Jiayi Fu.
Chinese: 本研究提出了一种新颖的奖励塑造方法PAR,通过利用奖励模型中的潜在偏好信号来有效缓解人类反馈强化学习中的奖励破解问题,在评估中展现出卓越的性能和鲁棒性。
English: This study introduces Preference As Reward (PAR), a novel reward shaping method that effectively mitigates reward hacking in reinforcement learning from human feedback by leveraging latent preferences, demonstrating superior performance and robustness in evaluations.

Authors:Siqi Guo, Ilgee Hong, Vicente Balmaseda, Changlong Yu, Liang Qiu, Xin Liu, Haoming Jiang, Tuo Zhao, Tianbao Yang
Title: Discriminative Finetuning of Generative Large Language Models without Reward Models and Human Preference Data
Abstract:
Supervised fine-tuning (SFT) has become a crucial step for aligning pretrained large language models (LLMs) using supervised datasets of input-output pairs. However, despite being supervised, SFT is inherently limited by its generative training objective. To address its limitations, the existing common strategy is to follow SFT with a separate phase of preference optimization (PO), which relies on either human-labeled preference data or a strong reward model to guide the learning process. In this paper, we address the limitations of SFT by exploring one of the most successful techniques in conventional supervised learning: discriminative learning. We introduce Discriminative Fine-Tuning (DFT), an improved variant of SFT, which mitigates the burden of collecting human-labeled preference data or training strong reward models. Unlike SFT that employs a generative approach and overlooks negative data, DFT adopts a discriminative paradigm that increases the probability of positive answers while suppressing potentially negative ones, aiming for data prediction instead of token prediction. Our contributions include: (i) a discriminative probabilistic framework for fine-tuning LLMs by explicitly modeling the discriminative likelihood of an answer among all possible outputs given an input; (ii) efficient algorithms to optimize this discriminative likelihood; and (iii) extensive experiments demonstrating DFT's effectiveness, achieving performance better than SFT and comparable to if not better than SFT$\rightarrow$PO. The code can be found at https://github.com/Optimization-AI/DFT.
Chinese: 本文提出了一种改进的监督微调方法——判别式微调(DFT),它采用判别式学习范式增强正面回答的概率并抑制负面回答,无需依赖偏好数据或奖励模型即可实现优于传统方法或与之相当的性能。
English: This paper introduces Discriminative Fine-Tuning (DFT), an enhanced version of supervised fine-tuning that uses discriminative learning to increase the likelihood of positive responses while reducing negative ones, achieving performance comparable to or better than traditional methods without requiring preference data or reward models.

Authors:Silei Xu, Wenhao Xie, Lingxiao Zhao, Pengcheng He
Title: Chain of Draft: Thinking Faster by Writing Less
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in solving complex reasoning tasks through mechanisms like Chain-of-Thought (CoT) prompting, which emphasizes verbose, step-by-step reasoning. However, humans typically employ a more efficient strategy: drafting concise intermediate thoughts that capture only essential information. In this work, we propose Chain of Draft (CoD), a novel paradigm inspired by human cognitive processes, where LLMs generate minimalistic yet informative intermediate reasoning outputs while solving tasks. By reducing verbosity and focusing on critical insights, CoD matches or surpasses CoT in accuracy while using as little as only 7.6% of the tokens, significantly reducing cost and latency across various reasoning tasks. Our code and data are available at https://github.com/sileix/chain-of-draft.
中文总结:提出的思维草稿链(CoD)方法让大语言模型生成极简的中间推理步骤,仅用7.6%的词汇量即可达到或超越思维链的准确率,显著提升效率。
English Summary: The proposed Chain of Draft (CoD) method enables LLMs to produce minimal intermediate reasoning steps, achieving comparable or superior accuracy to Chain-of-Thought while using only 7.6% of tokens for greater efficiency.

Authors:Anton Lavrouk, Tarek Naous, Alan Ritter, Wei Xu
Title: What are Foundation Models Cooking in the Post-Soviet World?
Abstract:
The culture of the Post-Soviet states is complex, shaped by a turbulent history that continues to influence current events. In this study, we investigate the Post-Soviet cultural food knowledge of foundation models by constructing BORSch, a multimodal dataset encompassing 1147 and 823 dishes in the Russian and Ukrainian languages, centered around the Post-Soviet region. We demonstrate that leading models struggle to correctly identify the origins of dishes from Post-Soviet nations in both text-only and multimodal Question Answering (QA), instead over-predicting countries linked to the language the question is asked in. Through analysis of pretraining data, we show that these results can be explained by misleading dish-origin co-occurrences, along with linguistic phenomena such as Russian-Ukrainian code mixing. Finally, to move beyond QA-based assessments, we test models' abilities to produce accurate visual descriptions of dishes. The weak correlation between this task and QA suggests that QA alone may be insufficient as an evaluation of cultural understanding. To foster further research, we will make BORSch publicly available at https://github.com/alavrouk/BORSch.
中文: 本研究通过引入BORSch后苏联菜肴多模态数据集,揭示了基础模型因语言偏见常误判菜肴起源,并证明仅靠问答不足以评估文化理解能力。
English: This study introduces BORSch, a multimodal dataset of Post-Soviet dishes, revealing that foundation models often misattribute dish origins due to linguistic biases and demonstrates that question-answering alone is inadequate for evaluating cultural understanding.

Authors:Zhewei Kang, Xuandong Zhao, Dawn Song
Title: Scalable Best-of-N Selection for Large Language Models via Self-Certainty
Abstract:
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models (LLMs) through increased test-time computation. Current state-of-the-art methods often employ computationally intensive reward models for response evaluation and selection. Reward-free alternatives, like self-consistency and universal self-consistency, are limited in their ability to handle open-ended generation tasks or scale effectively. To address these limitations, we propose self-certainty, a novel and efficient metric that leverages the inherent probability distribution of LLM outputs to estimate response quality without requiring external reward models. We hypothesize that higher distributional self-certainty, aggregated across multiple samples, correlates with improved response accuracy, as it reflects greater confidence in the generated output. Through extensive experiments on various reasoning tasks, we demonstrate that self-certainty (1) scales effectively with increasing sample size $N$, akin to reward models but without the computational overhead; (2) complements chain-of-thought, improving reasoning performance beyond greedy decoding; and (3) generalizes to open-ended tasks where traditional self-consistency methods fall short. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities. The code is available at https://github.com/backprop07/Self-Certainty
中文: 本文提出“自确定性”这一新指标,利用大语言模型内部概率分布来高效评估回答质量,无需外部奖励模型,实验证明其在推理任务中比现有方法具有更好的可扩展性和性能表现。
English: The paper introduces "self-certainty," a novel metric that uses LLMs' internal probability distributions to efficiently evaluate response quality without external rewards, demonstrating improved scalability and performance across reasoning tasks compared to existing methods.

Authors:Ivoline Ngong, Swanand Kadhe, Hao Wang, Keerthiram Murugesan, Justin D. Weisz, Amit Dhurandhar, Karthikeyan Natesan Ramamurthy
Title: Protecting Users From Themselves: Safeguarding Contextual Privacy in Interactions with Conversational Agents
Abstract:
Conversational agents are increasingly woven into individuals' personal lives, yet users often underestimate the privacy risks associated with them. The moment users share information with these agents-such as large language models (LLMs)-their private information becomes vulnerable to exposure. In this paper, we characterize the notion of contextual privacy for user interactions with LLM-based Conversational Agents (LCAs). It aims to minimize privacy risks by ensuring that users (sender) disclose only information that is both relevant and necessary for achieving their intended goals when interacting with LCAs (untrusted receivers). Through a formative design user study, we observe how even "privacy-conscious" users inadvertently reveal sensitive information through indirect disclosures. Based on insights from this study, we propose a locally deployable framework that operates between users and LCAs, identifying and reformulating out-of-context information in user prompts. Our evaluation using examples from ShareGPT shows that lightweight models can effectively implement this framework, achieving strong gains in contextual privacy while preserving the user's intended interaction goals. Notably, about 76% of participants in our human evaluation preferred the reformulated prompts over the original ones, validating the usability and effectiveness of contextual privacy in our proposed framework. We opensource the code at https://github.com/IBM/contextual-privacy-LLM.
中文: 本文提出了面向大语言模型对话代理的情境隐私概念,通过一个本地部署框架有效重构用户提示以最小化非必要信息泄露,在保持交互目标的同时获得76%参与者对隐私增强版本的选择偏好。
English: This paper introduces contextual privacy for LLM-based conversational agents, proposing a local framework that effectively reformulates user prompts to minimize unnecessary information disclosure while preserving interaction goals, with 76% of participants preferring the privacy-enhanced versions.

Authors:Aman Goel, Xian Carrie Wu, Zhe Wang, Dmitriy Bespalov, Yanjun Qi
Title: TurboFuzzLLM: Turbocharging Mutation-based Fuzzing for Effectively Jailbreaking Large Language Models in Practice
Abstract:
Jailbreaking large-language models (LLMs) involves testing their robustness against adversarial prompts and evaluating their ability to withstand prompt attacks that could elicit unauthorized or malicious responses. In this paper, we present TurboFuzzLLM, a mutation-based fuzzing technique for efficiently finding a collection of effective jailbreaking templates that, when combined with harmful questions, can lead a target LLM to produce harmful responses through black-box access via user prompts. We describe the limitations of directly applying existing template-based attacking techniques in practice, and present functional and efficiency-focused upgrades we added to mutation-based fuzzing to generate effective jailbreaking templates automatically. TurboFuzzLLM achieves $\geq$ 95\% attack success rates (ASR) on public datasets for leading LLMs (including GPT-4o \& GPT-4 Turbo), shows impressive generalizability to unseen harmful questions, and helps in improving model defenses to prompt attacks. TurboFuzzLLM is available open source at https://github.com/amazon-science/TurboFuzzLLM.
中文摘要:TurboFuzzLLM是一种基于变异的模糊测试技术,能自动生成有效的越狱模板来测试大语言模型的鲁棒性,对GPT-4o等领先模型的攻击成功率超过95%,同时有助于增强模型防御能力。
English Summary: TurboFuzzLLM is a mutation-based fuzzing technique that automatically generates effective jailbreaking templates to test LLM robustness, achieving over 95% attack success rates against models like GPT-4o while helping improve their defenses.

Authors:Yizhe Zhang, Richard Bai, Zijin Gu, Ruixiang Zhang, Jiatao Gu, Emmanuel Abbe, Samy Bengio, Navdeep Jaitly
Title: What Makes the Preferred Thinking Direction for LLMs in Multiple-choice Questions?
Abstract:
Language models usually use left-to-right (L2R) autoregressive factorization. However, L2R factorization may not always be the best inductive bias. Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks. We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that R2L models can significantly outperform L2R models on several MCQ benchmarks, including logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability, and directional conditional entropy. We analyze the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled. Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous. Our code and checkpoints are released at https://github.com/apple/ml-reversal-blessing.
中文摘要:本研究表明,在多项选择题推理任务中,从右向左训练的语言模型优于传统的从左向右模型,揭示了性能与校准度和条件熵的关联,并为文本分布的最佳因子化提供了理论依据。
English Summary: This study demonstrates that right-to-left (R2L) trained language models outperform traditional left-to-right (L2R) models on multiple-choice reasoning tasks, revealing performance links to calibration and conditional entropy while providing theoretical insights into optimal text factorization.

Authors:Henry Peng Zou, Siffi Singh, Yi Nian, Jianfeng He, Jason Cai, Saab Mansour, Hang Su
Title: GLEAN: Generalized Category Discovery with Diverse and Quality-Enhanced LLM Feedback
Abstract:
Generalized Category Discovery (GCD) is a practical and challenging open-world task that aims to recognize both known and novel categories in unlabeled data using limited labeled data from known categories. Due to the lack of supervision, previous GCD methods face significant challenges, such as difficulty in rectifying errors for confusing instances, and inability to effectively uncover and leverage the semantic meanings of discovered clusters. Therefore, additional annotations are usually required for real-world applicability. However, human annotation is extremely costly and inefficient. To address these issues, we propose GLEAN, a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback. Our approach leverages three different types of LLM feedback to: (1) improve instance-level contrastive features, (2) generate category descriptions, and (3) align uncertain instances with LLM-selected category descriptions. Extensive experiments demonstrate the superior performance of \MethodName over state-of-the-art models across diverse datasets, metrics, and supervision settings. Our code is available at https://github.com/amazon-science/Glean.
Chinese: GLEAN是一个统一框架,通过主动学习多样化和质量增强的大型语言模型反馈,解决在未标记数据中识别已知和未知类别的挑战。
English: GLEAN is a unified framework for generalized category discovery that actively learns from diverse and quality-enhanced LLM feedback to address challenges in recognizing both known and novel categories in unlabeled data.

Authors:Ahmed Elhady, Eneko Agirre, Mikel Artetxe
Title: WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
Abstract:
We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.
中文: WiCkeD通过引入“以上都不是”选项来增强多项选择题库的复杂性,显著降低了模型表现,并揭示了不同大语言模型在推理能力上的差异敏感性。
English: WiCkeD enhances the complexity of multiple-choice benchmarks by adding "None of the above" options, significantly reducing model performance and revealing varied reasoning sensitivities across LLMs.

Authors:Jianhao Yan, Yun Luo, Yue Zhang
Title: RefuteBench 2.0 -- Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction
Abstract:
In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM's ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0
中文摘要:RefuteBench 2.0通过引入LLM代理作为反驳者和评估者,系统评估语言模型整合用户反馈的能力,发现现有模型虽能应对反驳,但在长对话中难以有效保持和运用这些信息。
English Summary: RefuteBench 2.0 introduces LLM agents as refuters and evaluators to assess how well language models incorporate user feedback, revealing that while current models address refutations, they struggle with retaining this information during extended dialogues.

Authors:Gaye Colakoglu, Gürkan Solmaz, Jonathan Fürst
Title: Problem Solved? Information Extraction Design Space for Layout-Rich Documents using LLMs
Abstract:
This paper defines and explores the design space for information extraction (IE) from layout-rich documents using large language models (LLMs). The three core challenges of layout-aware IE with LLMs are 1) data structuring, 2) model engagement, and 3) output refinement. Our study investigates the sub-problems and methods within these core challenges, such as input representation, chunking, prompting, selection of LLMs, and multimodal models. It examines the effect of different design choices through LayIE-LLM, a new, open-source, layout-aware IE test suite, benchmarking against traditional, fine-tuned IE models. The results on two IE datasets show that LLMs require adjustment of the IE pipeline to achieve competitive performance: the optimized configuration found with LayIE-LLM achieves 13.3--37.5 F1 points more than a general-practice baseline configuration using the same LLM. To find a well-working configuration, we develop a one-factor-at-a-time (OFAT) method that achieves near-optimal results. Our method is only 0.8--1.8 points lower than the best full factorial exploration with a fraction (2.8%) of the required computation. Overall, we demonstrate that, if well-configured, general-purpose LLMs match the performance of specialized models, providing a cost-effective, finetuning-free alternative. Our test-suite is available at https://github.com/gayecolakoglu/LayIE-LLM.
中文: 本研究探讨了利用大型语言模型进行布局感知信息提取的设计空间,通过测试套件证明优化配置无需微调即可媲美专用模型的性能。
English: This study explores the design space for layout-aware information extraction using large language models, introducing a test suite that demonstrates optimized configurations can match specialized model performance without fine-tuning.

Authors:Laura Perez-Beltrachini, Mirella Lapata
Title: Uncertainty Quantification in Retrieval Augmented Question Answering
Abstract:
Retrieval augmented Question Answering (QA) helps QA models overcome knowledge gaps by incorporating retrieved evidence, typically a set of passages, alongside the question at test time. Previous studies show that this approach improves QA performance and reduces hallucinations, without, however, assessing whether the retrieved passages are indeed useful at answering correctly. In this work, we propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with. We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods. Code and data are available at https://github.com/lauhaide/ragu.
中文摘要:本研究提出了一种通过预测检索段落效用来量化检索增强问答中不确定性的方法,证明轻量级神经网络模型能有效评估答案正确性,并达到或超越昂贵采样方法的性能。
English Summary: This research introduces a method to quantify uncertainty in retrieval-augmented question answering by predicting the utility of retrieved passages, demonstrating that a lightweight neural model effectively estimates answer correctness and matches or surpasses costly sampling-based approaches.

Authors:Cao Yuxuan, Wu Jiayang, Alistair Cheong Liang Chuen, Bryan Shan Guanrong, Theodore Lee Chong Jen, Sherman Chann Zhi Shen
Title: Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models
Abstract:
Traditional online content moderation systems struggle to classify modern multimodal means of communication, such as memes, a highly nuanced and information-dense medium. This task is especially hard in a culturally diverse society like Singapore, where low-resource languages are used and extensive knowledge on local context is needed to interpret online content. We curate a large collection of 112K memes labeled by GPT-4V for fine-tuning a VLM to classify offensive memes in Singapore context. We show the effectiveness of fine-tuned VLMs on our dataset, and propose a pipeline containing OCR, translation and a 7-billion parameter-class VLM. Our solutions reach 80.62% accuracy and 0.8192 AUROC on a held-out test set, and can greatly aid human in moderating online contents. The dataset, code, and model weights have been open-sourced at https://github.com/aliencaocao/vlm-for-memes-aisg.
中文摘要:传统内容审核系统难以处理如表情包这类多模态内容,尤其是在文化多元的新加坡,但通过在大规模数据集上微调视觉语言模型,对冒犯性表情包的分类准确率达到了80.62%。
English Summary: Traditional content moderation systems are ineffective for nuanced multimodal content like memes, especially in culturally diverse Singapore, but fine-tuning a vision-language model on a large dataset achieves 80.62% accuracy in classifying offensive memes.

Authors:Zhijun Chen, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, Yikun Ban, Hailong Sun, Philip S. Yu
Title: Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
Abstract:
LLM Ensemble -- which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during downstream inference, to benefit from their individual strengths -- has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of the methods under the broad categories of "ensemble-before-inference, ensemble-during-inference, ensemble-after-inference'', and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at https://github.com/junchenzhi/Awesome-LLM-Ensemble.
中文: 本文首次系统综述了大语言模型集成方法,将其分类为推理前、推理中和推理后集成,并探讨了相关基准、应用及未来研究方向。
English: This paper provides the first systematic review of LLM Ensemble, categorizing methods into ensemble-before, during, and after-inference, and discusses benchmarks, applications, and future research directions.

Authors:Zhuo Chen, Xinyu Wang, Yong Jiang, Zhen Zhang, Xinyu Geng, Pengjun Xie, Fei Huang, Kewei Tu
Title: Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference
Abstract:
Despite the advancements made in Vision Large Language Models (VLLMs), like text Large Language Models (LLMs), they have limitations in addressing questions that require real-time information or are knowledge-intensive. Indiscriminately adopting Retrieval Augmented Generation (RAG) techniques is an effective yet expensive way to enable models to answer queries beyond their knowledge scopes. To mitigate the dependence on retrieval and simultaneously maintain, or even improve, the performance benefits provided by retrieval, we propose a method to detect the knowledge boundary of VLLMs, allowing for more efficient use of techniques like RAG. Specifically, we propose a method with two variants that fine-tune a VLLM on an automatically constructed dataset for boundary identification. Experimental results on various types of Visual Question Answering datasets show that our method successfully depicts a VLLM's knowledge boundary, based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance. In addition, we show that the knowledge boundary identified by our method for one VLLM can be used as a surrogate boundary for other VLLMs. Code will be released at https://github.com/Chord-Chen-30/VLLM-KnowledgeBoundary
Chinese Summary: 本研究提出了一种识别视觉大语言模型知识边界的方法,通过选择性使用检索技术减少不必要的检索,同时在多种视觉问答任务中保持或提升性能。
English Summary: This study introduces a method to identify the knowledge boundaries of Vision Large Language Models (VLLMs), enabling selective use of retrieval techniques to reduce unnecessary retrievals while maintaining or enhancing performance across various Visual Question Answering tasks.

Authors:Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
Title: ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Abstract:
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark. The code is available at https://github.com/Alibaba-NLP/ViDoRAG.
中文摘要:本文提出ViDoRAG多智能体框架,通过混合检索策略和迭代推理工作流解决现有RAG方法在处理视觉文档时的不足,在ViDoSeek基准测试中性能提升超过10%。
English Summary: The abstract introduces ViDoRAG, a multi-agent framework that addresses limitations in current RAG methods for visually rich documents by employing hybrid retrieval and iterative reasoning workflows, achieving over 10% improvement on the new ViDoSeek benchmark.

Authors:Xinghao Chen, Zhijing Sun, Wenjin Guo, Miaoran Zhang, Yanjun Chen, Yirong Sun, Hui Su, Yijie Pan, Dietrich Klakow, Wenjie Li, Xiaoyu Shen
Title: Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning
Abstract:
Large Language Models (LLMs) excel in reasoning tasks through Chain-of-Thought (CoT) prompting. However, CoT prompting greatly increases computational demands, which has prompted growing interest in distilling CoT capabilities into Small Language Models (SLMs). This study systematically examines the factors influencing CoT distillation, including the choice of granularity, format and teacher model. Through experiments involving four teacher models and seven student models across seven mathematical and commonsense reasoning datasets, we uncover three key findings: (1) Unlike LLMs, SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision; (2) CoT format significantly impacts LLMs but has minimal effect on SLMs, likely due to their reliance on supervised fine-tuning rather than pretraining preferences; (3) Stronger teacher models do NOT always produce better student models, as diversity and complexity in CoT supervision can outweigh accuracy alone. These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs. The code and datasets are available at https://github.com/EIT-NLP/Distilling-CoT-Reasoning.
中文摘要:本研究表明,针对小语言模型的思维链能力蒸馏需要定制化策略,因为小模型对推理粒度、格式和教师模型的选择响应方式与大模型不同,且更强的教师模型未必产生更好的学生模型。
English Summary: This study reveals that effective Chain-of-Thought distillation for Small Language Models requires tailored strategies, as SLMs respond differently than LLMs to granularity, format, and teacher model selection, with stronger teachers not always yielding better results.

Authors:Tianmi Ma, Jiawei Du, Wenxin Huang, Wenjie Wang, Liang Xie, Xian Zhong, Joey Tianyi Zhou
Title: Agent Trading Arena: A Study on Numerical Understanding in LLM-Based Agents
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in natural language tasks, yet their performance in dynamic, real-world financial environments remains underexplored. Existing approaches are limited to historical backtesting, where trading actions cannot influence market prices and agents train only on static data. To address this limitation, we present the Agent Trading Arena, a virtual zero-sum stock market in which LLM-based agents engage in competitive multi-agent trading and directly impact price dynamics. By simulating realistic bid-ask interactions, our platform enables training in scenarios that closely mirror live markets, thereby narrowing the gap between training and evaluation. Experiments reveal that LLMs struggle with numerical reasoning when given plain-text data, often overfitting to local patterns and recent values. In contrast, chart-based visualizations significantly enhance both numerical reasoning and trading performance. Furthermore, incorporating a reflection module yields additional improvements, especially with visual inputs. Evaluations on NASDAQ and CSI datasets demonstrate the superiority of our method, particularly under high volatility. All code and data are available at https://github.com/wekjsdvnm/Agent-Trading-Arena.
大语言模型在实时金融交易中因数值推理能力不足和过拟合问题表现欠佳,而代理交易竞技场通过引入可视化输入和反思模块的竞争性模拟,显著提升了交易性能,尤其在市场波动剧烈时效果更为突出。
Large language models struggle with real-time financial trading due to limitations in numerical reasoning and overfitting, but the Agent Trading Arena introduces a competitive simulation with visual inputs and reflection modules that significantly enhance performance, especially in volatile markets.

Authors:Qianying Liu, Katrina Qiyao Wang, Fei Cheng, Sadao Kurohashi
Title: Assessing Agentic Large Language Models in Multilingual National Bias
Abstract:
Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM's applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education. \footnote{Code available at: https://github.com/yiyunya/assess_agentic_national_bias
中文摘要:本研究揭示大型语言模型在跨语言决策任务中存在普遍的本土语言偏好,尽管新版模型有所改进,但仍无法实现稳健的多语言对齐,这对多语言AI应用具有重要影响。
English Summary: This study investigates multilingual bias in large language models, revealing persistent local language preferences across decision-making tasks despite some improvements in newer models, which fail to achieve robust cross-language alignment.

Authors:Haitao Li, Jiaying Ye, Yiran Hu, Jia Chen, Qingyao Ai, Yueyue Wu, Junjie Chen, Yifan Chen, Cheng Luo, Quan Zhou, Yiqun Liu
Title: CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation
Abstract:
Legal case documents play a critical role in judicial proceedings. As the number of cases continues to rise, the reliance on manual drafting of legal case documents is facing increasing pressure and challenges. The development of large language models (LLMs) offers a promising solution for automating document generation. However, existing benchmarks fail to fully capture the complexities involved in drafting legal case documents in real-world scenarios. To address this gap, we introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain. CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections. It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results. To the best of our knowledge, CaseGen is the first benchmark designed to evaluate LLMs in the context of legal case document generation. To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations. We evaluate several widely used general-domain LLMs and legal-specific LLMs, highlighting their limitations in case document generation and pinpointing areas for potential improvement. This work marks a step toward a more effective framework for automating legal case documents drafting, paving the way for the reliable application of AI in the legal field. The dataset and code are publicly available at https://github.com/CSHaitao/CaseGen.
中文摘要:本文提出了CaseGen,这是首个针对中文法律领域多阶段案件文书生成的基准测试,通过专家标注的真实案例和新颖的评估框架填补现有基准的不足,为AI在法律文书自动生成的可靠应用铺平道路。
English Summary: This paper introduces CaseGen, the first benchmark for evaluating large language models in multi-stage legal case document generation within the Chinese legal system, addressing gaps in existing benchmarks through expert-annotated real cases and a novel evaluation framework.

Authors:Shiping Gao, Fanqi Wan, Jiajian Guo, Xiaojun Quan, Qifan Wang
Title: Advantage-Guided Distillation for Preference Alignment in Small Language Models
Abstract:
Alignment techniques enable Large Language Models (LLMs) to generate outputs that align with human preferences and play a crucial role in their effectiveness. However, their impact often diminishes when applied to Small Language Models (SLMs), likely due to the limited capacity of these models. Instead of directly applying existing alignment techniques to SLMs, we propose to utilize a well-aligned teacher LLM to guide the alignment process for these models, thereby facilitating the transfer of the teacher's knowledge of human preferences to the student model. To achieve this, we first explore a straightforward approach, Dual-Constrained Knowledge Distillation (DCKD), that employs knowledge distillation with two KL-divergence constraints from the aligned teacher to the unaligned student. To further enhance the student's ability to distinguish between preferred and dispreferred responses, we then propose Advantage-Guided Distillation for Preference Alignment (ADPA), which leverages an advantage function from the aligned teacher to deliver more nuanced, distribution-level reward signals for the student's alignment. Our experimental results show that these two approaches appreciably improve the alignment of SLMs and narrow the performance gap with larger counterparts. Among them, ADPA demonstrates superior performance and achieves even greater effectiveness when integrated with DCKD. Our code is available at https://github.com/SLIT-AI/ADPA.
中文: 由于小型语言模型能力有限,大语言模型的对齐技术对其效果不佳,因此我们提出DCKD和ADPA两种知识蒸馏方法,利用对齐良好的教师大模型向学生模型传递人类偏好知识,显著提升了小型模型的对齐效果并缩小了与大型模型的性能差距。
English: Alignment techniques for Large Language Models often fail with Small Language Models due to their limited capacity, so we propose two knowledge distillation methods—DCKD and ADPA—that use a well-aligned teacher LLM to transfer human preference knowledge to SLMs, significantly improving their alignment and narrowing the performance gap with larger models.

Authors:Mingyan Wu, Zhenghao Liu, Yukun Yan, Xinze Li, Shi Yu, Zheni Zeng, Yu Gu, Ge Yu
Title: RankCoT: Refining Knowledge for Retrieval-Augmented Generation through Ranking Chain-of-Thoughts
Abstract:
Retrieval-Augmented Generation (RAG) enhances the performance of Large Language Models (LLMs) by incorporating external knowledge. However, LLMs still encounter challenges in effectively utilizing the knowledge from retrieved documents, often being misled by irrelevant or noisy information. To address this issue, we introduce RankCoT, a knowledge refinement method that incorporates reranking signals in generating CoT-based summarization for knowledge refinement based on given query and all retrieval documents. During training, RankCoT prompts the LLM to generate Chain-of-Thought (CoT) candidates based on the query and individual documents. It then fine-tunes the LLM to directly reproduce the best CoT from these candidate outputs based on all retrieved documents, which requires LLM to filter out irrelevant documents during generating CoT-style summarization. Additionally, RankCoT incorporates a self-reflection mechanism that further refines the CoT outputs, resulting in higher-quality training data. Our experiments demonstrate the effectiveness of RankCoT, showing its superior performance over other knowledge refinement models. Further analysis reveals that RankCoT can provide shorter but effective refinement results, enabling the generator to produce more accurate answers. All code and data are available at https://github.com/NEUIR/RankCoT.
中文: RankCoT是一种知识精炼方法,通过重排序和自反思机制生成思维链摘要,有效过滤无关文档以提升大语言模型答案的准确性。
English: RankCoT is a knowledge refinement method that enhances Large Language Models by generating Chain-of-Thought summaries through reranking and self-reflection, effectively filtering irrelevant documents to produce more accurate answers.

Authors:Hannah Calzi Kleidermacher, James Zou
Title: Science Across Languages: Assessing LLM Multilingual Translation of Scientific Papers
Abstract:
Scientific research is inherently global. However, the vast majority of academic journals are published exclusively in English, creating barriers for non-native-English-speaking researchers. In this study, we leverage large language models (LLMs) to translate published scientific articles while preserving their native JATS XML formatting, thereby developing a practical, automated approach for implementation by academic journals. Using our approach, we translate articles across multiple scientific disciplines into 28 languages. To evaluate translation accuracy, we introduce a novel question-and-answer (QA) benchmarking method, in which an LLM generates comprehension-based questions from the original text and then answers them based on the translated text. Our benchmark results show an average performance of 95.9%, showing that the key scientific details are accurately conveyed. In a user study, we translate the scientific papers of 15 researchers into their native languages, finding that the authors consistently found the translations to accurately capture the original information in their articles. Interestingly, a third of the authors found many technical terms "overtranslated," expressing a preference to keep terminology more familiar in English untranslated. Finally, we demonstrate how in-context learning techniques can be used to align translations with domain-specific preferences such as mitigating overtranslation, highlighting the adaptability and utility of LLM-driven scientific translation. The code and translated articles are available at https://hankleid.github.io/ProjectMundo.

Authors:Jianghao Chen, Zhenlin Wei, Zhenjiang Ren, Ziyong Li, Jiajun Zhang
Title: LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
Abstract:
Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making assumptions, backtracking, and self-refinement. However, effectively evaluating such reflection capabilities remains challenging due to the lack of appropriate benchmarks. To bridge this gap, we introduce LR$^2$Bench, a novel benchmark designed to evaluate the Long-chain Reflective Reasoning capabilities of LLMs. LR$^2$Bench comprises 850 samples across six Constraint Satisfaction Problems (CSPs) where reflective reasoning is crucial for deriving solutions that meet all given constraints. Each type of task focuses on distinct constraint patterns, such as knowledge-based, logical, and spatial constraints, providing a comprehensive evaluation of diverse problem-solving scenarios. Our extensive evaluation on both conventional LLMs and LRMs reveals that even the most advanced LRMs, such as DeepSeek-R1 and OpenAI o1-preview, struggle with tasks in LR$^2$Bench, achieving an average Exact Match score of only 20.0% and 23.6%, respectively. These findings underscore the significant room for improvement in the reflective reasoning capabilities of current LLMs.
Chinese: 大型推理模型(LRMs)虽提升了LLMs的推理能力,但评估其反思能力仍具挑战,为此推出的LR²Bench基准测试显示,即使顶尖LRMs如DeepSeek-R1和OpenAI o1-preview也表现不佳,平均准确率仅20.0%和23.6%,表明当前模型反思推理能力亟待提升。
English: Large Reasoning Models (LRMs) have advanced reasoning in LLMs, but evaluating their reflection capabilities remains challenging, prompting the introduction of LR²Bench, a benchmark that reveals even top LRMs like DeepSeek-R1 and OpenAI o1-preview struggle with only 20.0% and 23.6% average scores, highlighting significant room for improvement.

Authors:Xiongxiao Xu, Haoran Wang, Yueqing Liang, Philip S. Yu, Yue Zhao, Kai Shu
Title: Can Multimodal LLMs Perform Time Series Anomaly Detection?
Abstract:
Large language models (LLMs) have been increasingly used in time series analysis. However, the potential of multimodal LLMs (MLLMs), particularly vision-language models, for time series remains largely under-explored. One natural way for humans to detect time series anomalies is through visualization and textual description. Motivated by this, we raise a critical and practical research question: Can multimodal LLMs perform time series anomaly detection? To answer this, we propose VisualTimeAnomaly benchmark to evaluate MLLMs in time series anomaly detection (TSAD). Our approach transforms time series numerical data into the image format and feed these images into various MLLMs, including proprietary models (GPT-4o and Gemini-1.5) and open-source models (LLaVA-NeXT and Qwen2-VL), each with one larger and one smaller variant. In total, VisualTimeAnomaly contains 12.4k time series images spanning 3 scenarios and 3 anomaly granularities with 9 anomaly types across 8 MLLMs. Starting with the univariate case (point- and range-wise anomalies), we extend our evaluation to more practical scenarios, including multivariate and irregular time series scenarios, and variate-wise anomalies. Our study reveals several key insights: 1) MLLMs detect range- and variate-wise anomalies more effectively than point-wise anomalies. 2) MLLMs are highly robust to irregular time series, even with 25% of the data missing. 3) Open-source MLLMs perform comparably to proprietary models in TSAD. While open-source MLLMs excel on univariate time series, proprietary MLLMs demonstrate superior effectiveness on multivariate time series. To the best of our knowledge, this is the first work to comprehensively investigate MLLMs for TSAD, particularly for multivariate and irregular time series scenarios. We release our dataset and code at https://github.com/mllm-ts/VisualTimeAnomaly to support future research.
Chinese: 本研究提出VisualTimeAnomaly基准,通过将时间序列数值数据转换为图像来评估多模态大语言模型在异常检测中的表现,发现模型能有效检测范围和变量级异常、对不规则数据具有强鲁棒性,且开源模型在单变量场景下与商业模型性能相当。
English: This study introduces the VisualTimeAnomaly benchmark to evaluate multimodal large language models (MLLMs) on time series anomaly detection by converting numerical data into images, revealing that MLLMs effectively detect range- and variate-wise anomalies, show robustness to irregular data, and that open-source models perform comparably to proprietary ones in univariate cases.

Authors:Ruxiao Chen, Chenguang Wang, Yuran Sun, Xilei Zhao, Susu Xu
Title: From Perceptions to Decisions: Wildfire Evacuation Decision Prediction with Behavioral Theory-informed LLMs
Abstract:
Evacuation decision prediction is critical for efficient and effective wildfire response by helping emergency management anticipate traffic congestion and bottlenecks, allocate resources, and minimize negative impacts. Traditional statistical methods for evacuation decision prediction fail to capture the complex and diverse behavioral logic of different individuals. In this work, for the first time, we introduce FLARE, short for facilitating LLM for advanced reasoning on wildfire evacuation decision prediction, a Large Language Model (LLM)-based framework that integrates behavioral theories and models to streamline the Chain-of-Thought (CoT) reasoning and subsequently integrate with memory-based Reinforcement Learning (RL) module to provide accurate evacuation decision prediction and understanding. Our proposed method addresses the limitations of using existing LLMs for evacuation behavioral predictions, such as limited survey data, mismatching with behavioral theory, conflicting individual preferences, implicit and complex mental states, and intractable mental state-behavior mapping. Experiments on three post-wildfire survey datasets show an average of 20.47% performance improvement over traditional theory-informed behavioral models, with strong cross-event generalizability. Our complete code is publicly available at https://github.com/SusuXu-s-Lab/FLARE
Chinese: FLARE框架通过结合行为理论与大语言模型推理及强化学习,改进了野火疏散决策预测,相比传统模型性能提升20.47%,有效解决了数据不足和理论不匹配等局限性。
English: The FLARE framework enhances wildfire evacuation decision prediction by integrating behavioral theories with LLM-based reasoning and reinforcement learning, achieving a 20.47% performance improvement over traditional models while addressing data and theory limitations.

Authors:Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman
Title: Synthetic Text Generation for Training Large Language Models via Gradient Matching
Abstract:
Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at https://github.com/BigML-CS-UCLA/GRADMM.
中文: 本研究提出了一种基于ADMM的理论严谨方法,可生成人类可读的合成文本,在保证收敛性、性能和隐私的前提下用于微调大语言模型,并在多项分类任务中得到验证。
English: This study introduces a theoretically rigorous method using ADMM to generate human-readable synthetic text that ensures convergence, performance, and privacy for fine-tuning LLMs, validated across multiple classification tasks.

Authors:Zhongwei Wan, Hui Shen, Xin Wang, Che Liu, Zheda Mai, Mi Zhang
Title: MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference
Abstract:
Long-context Multimodal Large Language Models (MLLMs) that incorporate long text-image and text-video modalities, demand substantial resources as their multimodal Key-Value (KV) caches grow with increasing input lengths, challenging inference efficiency. Existing methods for KV cache compression, in both text-only and multimodal LLMs, have neglected attention density variations across layers, thus often adopting uniform or progressive reduction strategies for layer-wise cache allocation. In this work, we propose MEDA, a dynamic layer-wise KV cache allocation method for efficient multimodal long-context inference. As its core, MEDA utilizes cross-modal attention entropy to determine the KV cache size at each MLLMs layer. Given the dynamically allocated KV cache size at each layer, MEDA also employs a KV pair selection scheme to identify which KV pairs to select and a KV pair merging strategy that merges the selected and non-selected ones to preserve information from the entire context. MEDA achieves up to 72% KV cache memory reduction and 2.82 times faster decoding speed, while maintaining or enhancing performance on various multimodal tasks in long-context settings, including multi-images and long-video scenarios. Our code is released at https://github.com/AIoT-MLSys-Lab/MEDA.
中文: MEDA提出了一种基于跨模态注意力熵的动态分层KV缓存分配方法,在保持多模态长上下文模型性能的同时,显著降低了内存使用并提升了解码速度。
English: MEDA introduces a dynamic layer-wise KV cache allocation method using cross-modal attention entropy to significantly reduce memory usage and accelerate decoding in multimodal long-context models while maintaining performance.

Authors:Simin Chen, Yiming Chen, Zexin Li, Yifan Jiang, Zhongwei Wan, Yixin He, Dezhi Ran, Tianle Gu, Haizhou Li, Tao Xie, Baishakhi Ray
Title: Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Abstract:
Data contamination has received increasing attention in the era of large language models (LLMs) due to their reliance on vast Internet-derived training corpora. To mitigate the risk of potential data contamination, LLM benchmarking has undergone a transformation from static to dynamic benchmarking. In this work, we conduct an in-depth analysis of existing static to dynamic benchmarking methods aimed at reducing data contamination risks. We first examine methods that enhance static benchmarks and identify their inherent limitations. We then highlight a critical gap-the lack of standardized criteria for evaluating dynamic benchmarks. Based on this observation, we propose a series of optimal design principles for dynamic benchmarking and analyze the limitations of existing dynamic benchmarks. This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.
中文: 本综述分析了为应对数据污染风险从静态基准测试向动态基准测试的转变,指出了现有评估标准的不足,并为动态基准测试提出了优化设计原则。
English: This survey analyzes the shift from static to dynamic benchmarking in large language models to address data contamination risks, identifies gaps in current evaluation standards, and proposes optimal design principles for dynamic benchmarks.

Authors:Yijia Xiao, Wanjia Zhao, Junkai Zhang, Yiqiao Jin, Han Zhang, Zhicheng Ren, Renliang Sun, Haixin Wang, Guancheng Wan, Pan Lu, Xiao Luo, Yu Zhang, James Zou, Yizhou Sun, Wei Wang
Title: Protein Large Language Models: A Comprehensive Survey
Abstract:
Protein-specific large language models (Protein LLMs) are revolutionizing protein science by enabling more efficient protein structure prediction, function annotation, and design. While existing surveys focus on specific aspects or applications, this work provides the first comprehensive overview of Protein LLMs, covering their architectures, training datasets, evaluation metrics, and diverse applications. Through a systematic analysis of over 100 articles, we propose a structured taxonomy of state-of-the-art Protein LLMs, analyze how they leverage large-scale protein sequence data for improved accuracy, and explore their potential in advancing protein engineering and biomedical research. Additionally, we discuss key challenges and future directions, positioning Protein LLMs as essential tools for scientific discovery in protein science. Resources are maintained at https://github.com/Yijia-Xiao/Protein-LLM-Survey.
中文摘要:本研究首次系统综述了蛋白质大语言模型,涵盖其架构、应用与挑战,确立了其在推动蛋白质科学发展的关键工具地位。
English Summary: This work presents the first comprehensive survey of Protein LLMs, detailing their architectures, applications, and challenges while positioning them as essential tools for advancing protein science.

Authors:Xu Wang, Jiaju Kang, Puyu Han, Yubao Zhao, Qian Liu, Liwenfei He, Lingqiong Zhang, Lingyun Dai, Yongcheng Wang, Jie Tao
Title: ECG-Expert-QA: A Benchmark for Evaluating Medical Large Language Models in Heart Disease Diagnosis
Abstract:
We present ECG-Expert-QA, a comprehensive multimodal dataset for evaluating diagnostic capabilities in electrocardiogram (ECG) interpretation. It combines real-world clinical ECG data with systematically generated synthetic cases, covering 12 essential diagnostic tasks and totaling 47,211 expert-validated QA pairs. These encompass diverse clinical scenarios, from basic rhythm recognition to complex diagnoses involving rare conditions and temporal changes. A key innovation is the support for multi-turn dialogues, enabling the development of conversational medical AI systems that emulate clinician-patient or interprofessional interactions. This allows for more realistic assessment of AI models' clinical reasoning, diagnostic accuracy, and knowledge integration. Constructed through a knowledge-guided framework with strict quality control, ECG-Expert-QA ensures linguistic and clinical consistency, making it a high-quality resource for advancing AI-assisted ECG interpretation. It challenges models with tasks like identifying subtle ischemic changes and interpreting complex arrhythmias in context-rich scenarios. To promote research transparency and collaboration, the dataset, accompanying code, and prompts are publicly released at https://github.com/Zaozzz/ECG-Expert-QA
中文: ECG-Expert-QA是一个结合真实与合成心电图案例的多模态数据集,包含47,211个专家验证的问答对,支持多轮对话功能,旨在推进临床推理和诊断准确性的会话式医疗AI系统发展。
English: ECG-Expert-QA is a multimodal dataset combining real and synthetic ECG cases with 47,211 expert-validated QA pairs, featuring multi-turn dialogues to advance conversational medical AI systems for clinical reasoning and diagnostic accuracy.

Authors:Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
Title: MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.
中文摘要:研究发现多模态大语言模型在感知图像细微视觉信息方面存在不足,但其注意力机制即使回答错误时仍能准确定位关键区域;通过利用模型内部的注意力和梯度图,开发出无需训练的可视干预方法,显著提升了模型对微小视觉细节的识别准确率。
English Summary: This study reveals that Multimodal Large Language Models struggle with perceiving small visual details in images, but their attention mechanisms correctly identify relevant areas even when answering incorrectly, leading to the development of training-free intervention methods that significantly improve accuracy by leveraging the models' internal attention and gradient maps.

Authors:Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, Bo An
Title: LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
Abstract:
As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this capability. Speculative decoding (SD) offers a promising lossless acceleration technique compared to lossy alternatives such as quantization and model cascades. However, most state-of-the-art SD methods are trained on short texts (typically fewer than 4k tokens), making them unsuitable for long-context scenarios. Specifically, adapting these methods to long contexts presents three key challenges: (1) the excessive memory demands posed by draft models due to large Key-Value (KV) cache; (2) performance degradation resulting from the mismatch between short-context training and long-context inference; and (3) inefficiencies in tree attention mechanisms when managing long token sequences. This work introduces LongSpec, a framework that addresses these challenges through three core innovations: a memory-efficient draft model with a constant-sized KV cache; novel position indices that mitigate the training-inference mismatch; and an attention aggregation strategy that combines fast prefix computation with standard tree attention to enable efficient decoding. Experimental results confirm the effectiveness of LongSpec, achieving up to a 3.26x speedup over strong Flash Attention baselines across five long-context understanding datasets, as well as a 2.25x reduction in wall-clock time on the AIME24 long reasoning task with the QwQ model, demonstrating significant latency improvements for long-context applications. The code is available at https://github.com/sail-sg/LongSpec.
Chinese: LongSpec是一种创新框架,通过内存高效的草稿模型、专用位置索引和优化注意力机制,解决了现有推测解码方法在长上下文场景中的局限性,在长上下文应用中实现了高达3.26倍的加速效果。
English: LongSpec is a novel framework that overcomes the limitations of existing speculative decoding methods in long-context scenarios through memory-efficient draft models, specialized position indices, and optimized attention mechanisms, achieving up to 3.26x speedup in long-context applications.

Authors:Yihong Liu, Runsheng Chen, Lea Hirlimann, Ahmad Dawar Hakimi, Mingyang Wang, Amir Hossein Kargaran, Sascha Rothe, François Yvon, Hinrich Schütze
Title: On Relation-Specific Neurons in Large Language Models
Abstract:
In large language models (LLMs), certain neurons can store distinct pieces of knowledge learned during pretraining. While knowledge typically appears as a combination of relations and entities, it remains unclear whether some neurons focus on a relation itself -- independent of any entity. We hypothesize such neurons detect a relation in the input text and guide generation involving such a relation. To investigate this, we study the Llama-2 family on a chosen set of relations with a statistics-based method. Our experiments demonstrate the existence of relation-specific neurons. We measure the effect of selectively deactivating candidate neurons specific to relation $r$ on the LLM's ability to handle (1) facts whose relation is $r$ and (2) facts whose relation is a different relation $r' \neq r$. With respect to their capacity for encoding relation information, we give evidence for the following three properties of relation-specific neurons. $\textbf{(i) Neuron cumulativity.}$ The neurons for $r$ present a cumulative effect so that deactivating a larger portion of them results in the degradation of more facts in $r$. $\textbf{(ii) Neuron versatility.}$ Neurons can be shared across multiple closely related as well as less related relations. Some relation neurons transfer across languages. $\textbf{(iii) Neuron interference.}$ Deactivating neurons specific to one relation can improve LLM generation performance for facts of other relations. We will make our code publicly available at https://github.com/cisnlp/relation-specific-neurons.
中文: 本研究在大语言模型中识别出关系特定神经元,它们能检测文本中的关系并引导生成,通过选择性失活实验揭示了这些神经元的累积性、通用性和干扰性特征。
English: This study identifies relation-specific neurons in large language models that detect relations in text and guide generation, revealing their cumulative, versatile, and interfering properties through selective deactivation experiments.

Authors:Zhenghao Liu, Haolan Wang, Xinze Li, Qiushi Xiong, Xiaocui Yang, Yu Gu, Yukun Yan, Qi Shi, Fangfang Li, Ge Yu, Maosong Sun
Title: HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization
Abstract:
Tabular data contains rich structural semantics and plays a crucial role in organizing and manipulating information. To better capture these structural semantics, this paper introduces the HybrId-modal Preference oPtimizatiOn (HIPPO) model, which represents tables using both text and image, and optimizes MLLMs to effectively learn more comprehensive table information from these multiple modalities. Specifically, HIPPO samples model responses from hybrid-modal table representations and designs a modality-consistent sampling strategy to enhance response diversity and mitigate modality bias during DPO training. Experimental results on table question answering and table fact verification tasks demonstrate the effectiveness of HIPPO, achieving a 4% improvement over various table reasoning models. Further analysis reveals that HIPPO not only enhances reasoning abilities based on unimodal table representations but also facilitates the extraction of crucial and distinct semantics from different modal representations. All data and codes are available at https://github.com/NEUIR/HIPPO.
中文摘要:本文提出的HIPPO模型采用文本与图像混合表示方法优化多模态学习,通过模态一致采样策略提升表格推理能力,在多项任务中实现4%的性能提升。
English Summary: This paper introduces the HIPPO model, which uses hybrid text-image representations to enhance table understanding and achieves a 4% performance improvement on reasoning tasks through modality-consistent optimization.

Authors:Yi-Kai Zhang, De-Chuan Zhan, Han-Jia Ye
Title: Capability Instruction Tuning: A New Paradigm for Dynamic LLM Routing
Abstract:
Large Language Models (LLMs) have demonstrated human-like instruction-following abilities, particularly those exceeding 100 billion parameters. The combined capability of some smaller, resource-friendly LLMs can address most of the instructions that larger LLMs excel at. In this work, we explore how to route the best-performing LLM for each instruction to achieve better overall performance. We develop a new paradigm, constructing capability instructions with model capability representation, user instruction, and performance inquiry prompts to assess the performance. To learn from capability instructions, we introduce a new end-to-end framework called Model Selection with Aptitude Test (Model-SAT), which generates positive and negative samples based on what different models perform well or struggle with. Model-SAT uses a model capability encoder that extends its model representation to a lightweight LLM. Our experiments show that Model-SAT understands the performance dimensions of candidate models and provides the probabilities of their capability to handle various instructions. Additionally, during deployment, a new model can quickly infer its aptitude test results across 50 tasks, each with 20 shots. Model-SAT performs state-of-the-art model routing without candidate inference and in real-world new model-released scenarios. The code is available at https://github.com/Now-Join-Us/CIT-LLM-Routing
超过1000亿参数的大型语言模型展现出类似人类的指令跟随能力,本研究提出Model-SAT框架,通过测试模型能力将指令路由至最佳执行模型而无需候选推理,在现实场景中实现最优性能。
Large language models with over 100 billion parameters show human-like instruction-following abilities, and this research introduces Model-SAT, a framework that routes instructions to the best-performing model by testing their capabilities without needing candidate inference, achieving top performance in real-world scenarios.

Authors:Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, Weipeng Chen
Title: Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Abstract:
We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio
中文: Baichuan-Audio 是一款端到端的音频大语言模型,集成了音频理解与生成功能,采用文本引导的语音生成机制和两阶段预训练策略,在实时语音对话和问答中表现卓越。
English: Baichuan-Audio is an end-to-end audio large language model that integrates audio understanding and generation, featuring a text-guided speech generation mechanism and a two-stage pre-training strategy to excel in real-time speech interaction and question-answering.

Authors:Boxuan Zhang, Ruqi Zhang
Title: CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought
Abstract:
Large language models (LLMs) excel in many tasks but struggle to accurately quantify uncertainty in their generated responses. This limitation makes it challenging to detect misinformation and ensure reliable decision-making. Existing uncertainty quantification (UQ) methods for LLMs are primarily prompt-wise rather than response-wise, often requiring multiple response samples, which incurs high computational costs. Moreover, LLMs have been shown to be overconfident, particularly when using reasoning steps to derive their answers. In this work, we propose CoT-UQ, a response-wise UQ framework that integrates LLMs' inherent reasoning capabilities through Chain-of-Thought (CoT) into the UQ process. CoT-UQ captures critical information during inference by extracting keywords from each reasoning step and assessing their importance to the final answer. This key reasoning information is then aggregated to produce a final uncertainty estimate. We conduct extensive experiments based on Llama Family with model sizes varying from 8B to 13B across logical and mathematical reasoning tasks. Experimental results demonstrate that CoT-UQ significantly outperforms existing UQ methods, achieving an average improvement of 5.9% AUROC compared to current UQ methods. The code is available at: https://github.com/ZBox1005/CoT-UQ.
Chinese: 本文提出CoT-UQ框架,通过利用大语言模型的思维链推理能力,从每个推理步骤中提取关键信息并评估其重要性,实现了响应式不确定性量化,在多项任务中以平均5.9%的AUROC提升显著优于现有方法,同时降低了计算成本。
English: This paper introduces CoT-UQ, a response-wise uncertainty quantification framework that leverages LLMs' Chain-of-Thought reasoning to extract and evaluate key information from each step, significantly outperforming existing methods by 5.9% AUROC on average while reducing computational costs.

Authors:Jie Zeng, Qianyu He, Qingyu Ren, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Title: Order Matters: Investigate the Position Bias in Multi-constraint Instruction Following
Abstract:
Real-world instructions with multiple constraints pose a significant challenge to existing large language models (LLMs). An observation is that the LLMs exhibit dramatic performance fluctuation when disturbing the order of the incorporated constraints. Yet, none of the existing works has systematically investigated this position bias problem in the field of multi-constraint instruction following. To bridge this gap, we design a probing task where we quantitatively measure the difficulty distribution of the constraints by a novel Difficulty Distribution Index (CDDI). Through the experimental results, we find that LLMs are more performant when presented with the constraints in a ``hard-to-easy'' order. This preference can be generalized to LLMs with different architecture or different sizes of parameters. Additionally, we conduct an explanation study, providing an intuitive insight into the correlation between the LLM's attention and constraint orders. Our code and dataset are publicly available at https://github.com/meowpass/PBIF.
Chinese: 研究发现,大型语言模型在约束条件按从难到易的顺序呈现时表现更佳,这种位置偏好适用于不同架构和规模的模型,并通过难度分布指数和注意力分析得到验证。
English: Large language models perform better when constraints are ordered from hardest to easiest, a position bias that persists across different model architectures and sizes, as revealed through a novel difficulty distribution index and attention analysis.

Authors:Yuming Yang, Yang Nan, Junjie Ye, Shihan Dou, Xiao Wang, Shuo Li, Huijie Lv, Mingqi Wu, Tao Gui, Qi Zhang, Xuanjing Huang
Title: Measuring Data Diversity for Instruction Tuning: A Systematic Analysis and A Reliable Metric
Abstract:
Data diversity is crucial for the instruction tuning of large language models. Existing studies have explored various diversity-aware data selection methods to construct high-quality datasets and enhance model performance. However, the fundamental problem of precisely defining and measuring data diversity remains underexplored, limiting clear guidance for data engineering. To address this, we systematically analyze 11 existing diversity measurement methods by evaluating their correlation with model performance through extensive fine-tuning experiments. Our results indicate that a reliable diversity measure should properly account for both inter-sample differences and the information density in the sample space. Building on this, we propose NovelSum, a new diversity metric based on sample-level "novelty." Experiments on both simulated and real-world data show that NovelSum accurately captures diversity variations and achieves a 0.97 correlation with instruction-tuned model performance, highlighting its value in guiding data engineering practices. With NovelSum as an optimization objective, we further develop a greedy, diversity-oriented data selection strategy that outperforms existing approaches, validating both the effectiveness and practical significance of our metric. The code is available at https://github.com/UmeanNever/NovelSum.
中文摘要:本研究提出了NovelSum这一新颖的多样性度量方法,通过衡量样本层面的新颖性有效关联模型性能,并通过优于现有方法的数据选择策略验证了其实际应用价值。
English Summary: This study introduces NovelSum, a novel diversity metric that effectively correlates with model performance by measuring sample-level novelty, and demonstrates its practical value through a data selection strategy that outperforms existing methods.

Authors:Huanghai Liu, Quzhe Huang, Qingjing Chen, Yiran Hu, Jiayu Ma, Yun Liu, Weixing Shen, Yansong Feng
Title: JUREX-4E: Juridical Expert-Annotated Four-Element Knowledge Base for Legal Reasoning
Abstract:
In recent years, Large Language Models (LLMs) have been widely applied to legal tasks. To enhance their understanding of legal texts and improve reasoning accuracy, a promising approach is to incorporate legal theories. One of the most widely adopted theories is the Four-Element Theory (FET), which defines the crime constitution through four elements: Subject, Object, Subjective Aspect, and Objective Aspect. While recent work has explored prompting LLMs to follow FET, our evaluation demonstrates that LLM-generated four-elements are often incomplete and less representative, limiting their effectiveness in legal reasoning. To address these issues, we present JUREX-4E, an expert-annotated four-element knowledge base covering 155 criminal charges. The annotations follow a progressive hierarchical framework grounded in legal source validity and incorporate diverse interpretive methods to ensure precision and authority. We evaluate JUREX-4E on the Similar Charge Disambiguation task and apply it to Legal Case Retrieval. Experimental results validate the high quality of JUREX-4E and its substantial impact on downstream legal tasks, underscoring its potential for advancing legal AI applications. The dataset and code are available at: https://github.com/THUlawtech/JUREX
中文摘要:为提升大语言模型的法律推理能力,JUREX-4E基于四要件理论构建了专家标注知识库,在相似罪名辨析和法律案例检索等任务中展现出显著效果。
English Summary: To improve legal reasoning in Large Language Models, JUREX-4E introduces an expert-annotated knowledge base based on the Four-Element Theory, significantly enhancing performance in legal tasks like charge disambiguation and case retrieval.

Authors:María Andrea Cruz Blandón, Jayasimha Talur, Bruno Charron, Dong Liu, Saab Mansour, Marcello Federico
Title: MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
Abstract:
Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. Our dataset is available at https://github.com/amazon-science/MEMERAG
中文:MEMERAG基准通过原生多语言方法评估检索增强生成系统,利用专家对忠实性和相关性的标注捕捉文化细微差异,为自动评估器提供可靠的多语言性能衡量标准。
English: The MEMERAG benchmark introduces a native multilingual approach to evaluate retrieval augmented generation systems, capturing cultural nuances and enabling reliable assessment of automatic evaluators through expert human annotations of faithfulness and relevance.

Authors:Bruno Puri, Aakriti Jain, Elena Golimblevskaia, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Title: FADE: Why Bad Descriptions Happen to Good Features
Abstract:
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes of the misalignment between features and their descriptions. We apply FADE to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE
中文摘要:本文提出FADE框架,用于评估自动化可解释性流程中特征与描述的匹配度,旨在弥补标准化评估方法的缺失,并揭示生成精确描述所面临的核心挑战。
English Summary: The paper introduces FADE, a scalable framework for evaluating feature-description alignment in automated interpretability pipelines, addressing the lack of standardized evaluation methods and highlighting challenges in generating accurate descriptions.

Authors:Yida Lu, Jiale Cheng, Zhexin Zhang, Shiyao Cui, Cunxiang Wang, Xiaotao Gu, Yuxiao Dong, Jie Tang, Hongning Wang, Minlie Huang
Title: LongSafety: Evaluating Long-Context Safety of Large Language Models
Abstract:
As Large Language Models (LLMs) continue to advance in understanding and generating long sequences, new safety concerns have been introduced through the long context. However, the safety of LLMs in long-context tasks remains under-explored, leaving a significant gap in both evaluation and improvement of their safety. To address this, we introduce LongSafety, the first comprehensive benchmark specifically designed to evaluate LLM safety in open-ended long-context tasks. LongSafety encompasses 7 categories of safety issues and 6 user-oriented long-context tasks, with a total of 1,543 test cases, averaging 5,424 words per context. Our evaluation towards 16 representative LLMs reveals significant safety vulnerabilities, with most models achieving safety rates below 55%. Our findings also indicate that strong safety performance in short-context scenarios does not necessarily correlate with safety in long-context tasks, emphasizing the unique challenges and urgency of improving long-context safety. Moreover, through extensive analysis, we identify challenging safety issues and task types for long-context models. Furthermore, we find that relevant context and extended input sequences can exacerbate safety risks in long-context scenarios, highlighting the critical need for ongoing attention to long-context safety challenges. Our code and data are available at https://github.com/thu-coai/LongSafety.
中文:LongSafety基准测试揭示了大语言模型在长上下文任务中存在显著安全漏洞,多数模型安全率低于55%,且扩展输入会加剧安全风险。
English: The LongSafety benchmark reveals significant safety vulnerabilities in large language models during long-context tasks, with most models scoring below 55% safety rates and demonstrating that extended inputs can exacerbate risks.

Authors:Md Saidul Hoque Anik, Ariful Azad
Title: SparseTransX: Efficient Training of Translation-Based Knowledge Graph Embeddings Using Sparse Matrix Operations
Abstract:
Knowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding is one of the dominant functions in the translation-based KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. We create a general framework for training KG models using sparse kernels and implement four models, namely TransE, TransR, TransH, and TorusE. Our sparse implementations exhibit up to 5.3x speedup on the CPU and up to 4.2x speedup on the GPU with a significantly low GPU memory footprint. The speedups are consistent across large and small datasets for a given model. Our proposed sparse approach can be extended to accelerate other translation-based (such as TransC, TransM, etc.) and non-translational (such as DistMult, ComplEx, RotatE, etc.) models as well. An implementation of the SpTransX framework is publicly available as a Python package in https://github.com/HipGraph/SpTransX.
Chinese: 该研究提出了一种基于稀疏内核的框架,利用SpMM加速知识图谱嵌入训练,在多种模型上实现了CPU最高5.3倍、GPU最高4.2倍的加速效果,同时显著降低了内存占用。
English: The study introduces a sparse kernel-based framework using SpMM to accelerate knowledge graph embedding training, achieving up to 5.3x speedup on CPUs and 4.2x on GPUs while reducing memory usage across various models.

Authors:Ruixuan Huang, Xunguang Wang, Zongjie Li, Daoyuan Wu, Shuai Wang
Title: GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Abstract:
Despite the growing interest in jailbreak methods as an effective red-teaming tool for building safe and responsible large language models (LLMs), flawed evaluation system designs have led to significant discrepancies in their effectiveness assessments. We conduct a systematic measurement study based on 37 jailbreak studies since 2022, focusing on both the methods and the evaluation systems they employ. We find that existing evaluation systems lack case-specific criteria, resulting in misleading conclusions about their effectiveness and safety implications. This paper advocates a shift to a more nuanced, case-by-case evaluation paradigm. We introduce GuidedBench, a novel benchmark comprising a curated harmful question dataset, detailed case-by-case evaluation guidelines and an evaluation system integrated with these guidelines -- GuidedEval. Experiments demonstrate that GuidedBench offers more accurate measurements of jailbreak performance, enabling meaningful comparisons across methods and uncovering new insights overlooked in previous evaluations. GuidedEval reduces inter-evaluator variance by at least 76.03\%. Furthermore, we observe that incorporating guidelines can enhance the effectiveness of jailbreak methods themselves, offering new insights into both attack strategies and evaluation paradigms.
中文摘要:本研究批评了现有大语言模型越狱评估系统的缺陷,提出采用个案化指南的GuidedBench新基准,显著提升评估准确性并降低评估者差异,同时发现指南还能增强越狱方法本身的有效性。
English Summary: This study critiques current jailbreak evaluation systems for LLMs, proposing GuidedBench with case-specific guidelines to improve accuracy and reduce evaluator variance, while also revealing that guidelines can enhance jailbreak effectiveness.

Authors:Himanshu Beniwal, Sailesh Panda, Birudugadda Srivibhav, Mayank Singh
Title: Char-mander Use mBackdoor! A Study of Cross-lingual Backdoor Attacks in Multilingual LLMs
Abstract:
We explore \textbf{C}ross-lingual \textbf{B}ackdoor \textbf{AT}tacks (X-BAT) in multilingual Large Language Models (mLLMs), revealing how backdoors inserted in one language can automatically transfer to others through shared embedding spaces. Using toxicity classification as a case study, we demonstrate that attackers can compromise multilingual systems by poisoning data in a single language, with rare and high-occurring tokens serving as specific, effective triggers. Our findings expose a critical vulnerability that influences the model's architecture, resulting in a concealed backdoor effect during the information flow. Our code and data are publicly available https://github.com/himanshubeniwal/X-BAT.
中文: 本研究提出跨语言后门攻击(X-BAT),通过毒性分类案例证明攻击者仅需污染单一语言数据,即可利用共享嵌入空间使后门在多语言模型中跨语言传播,稀有词汇作为触发器会形成隐蔽的系统漏洞。
English: This study introduces Cross-lingual Backdoor Attacks (X-BAT), demonstrating how backdoors implanted in one language can propagate to others in multilingual models via shared embeddings, using toxicity classification to show how poisoning a single language with rare tokens creates hidden vulnerabilities.

Authors:Zhexin Zhang, Leqi Lei, Junxiao Yang, Xijie Huang, Yida Lu, Shiyao Cui, Renmiao Chen, Qinglin Zhang, Xinyuan Wang, Hao Wang, Hao Li, Xianqi Lei, Chengwei Pan, Lei Sha, Hongning Wang, Minlie Huang
Title: AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement
Abstract:
As AI models are increasingly deployed across diverse real-world scenarios, ensuring their safety remains a critical yet underexplored challenge. While substantial efforts have been made to evaluate and enhance AI safety, the lack of a standardized framework and comprehensive toolkit poses significant obstacles to systematic research and practical adoption. To bridge this gap, we introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety. AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques while maintaining a well-structured and extensible codebase for future advancements. Additionally, we conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness. To facilitate ongoing research and development in AI safety, AISafetyLab is publicly available at https://github.com/thu-coai/AISafetyLab, and we are committed to its continuous maintenance and improvement.
中文: 针对AI安全领域缺乏标准化框架的问题,我们推出了AISafetyLab这一集成攻击、防御和评估方法的统一工具包,其具备直观界面和基于Vicuna的实证研究,并已开源以支持持续研究。
English: AISafetyLab is introduced as a unified framework and toolkit to address the lack of standardization in AI safety by integrating attack, defense, and evaluation methods, featuring an intuitive interface and empirical studies on Vicuna, with public availability for ongoing research.

Authors:Qianli Ma, Dongrui Liu, Qian Chen, Linfeng Zhang, Jing Shao
Title: LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Abstract:
Fine-tuning pre-trained Large Language Models (LLMs) for specialized tasks incurs substantial computational and data costs. While model merging offers a training-free solution to integrate multiple task-specific models, existing methods suffer from safety-utility conflicts where enhanced general capabilities degrade safety safeguards. We identify two root causes: $\textbf{neuron misidentification}$ due to simplistic parameter magnitude-based selection, and $\textbf{cross-task neuron interference}$ during merging. To address these challenges, we propose $\textbf{LED-Merging}$, a three-stage framework that $\textbf{L}$ocates task-specific neurons via gradient-based attribution, dynamically $\textbf{E}$lects critical neurons through multi-model importance fusion, and $\textbf{D}$isjoints conflicting updates through parameter isolation. Extensive experiments on Llama-3-8B, Mistral-7B, and Llama2-13B demonstrate that LED-Merging effectively reduces harmful response rates, showing a 31.4\% decrease on Llama-3-8B-Instruct on HarmBench, while simultaneously preserving 95\% of utility performance, such as achieving 52.39\% accuracy on GSM8K. LED-Merging resolves safety-utility conflicts and provides a lightweight, training-free paradigm for constructing reliable multi-task LLMs. Code is available at $\href{https://github.com/MqLeet/LED-Merging}{GitHub}$.
Chinese: LED-Merging是一种无需训练的框架,通过精确定位和隔离任务特定神经元解决模型合并中的安全-效用冲突,在减少31.4%有害响应的同时保持95%的效用性能。
English: LED-Merging is a training-free framework that addresses safety-utility conflicts in model merging by precisely identifying and isolating task-specific neurons, reducing harmful responses by 31.4% while maintaining 95% of utility performance.

Authors:Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, Serina Chang
Title: Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions
Abstract:
Large language models (LLMs) present novel opportunities in public opinion research by predicting survey responses in advance during the early stages of survey design. Prior methods steer LLMs via descriptions of subpopulations as LLMs' input prompt, yet such prompt engineering approaches have struggled to faithfully predict the distribution of survey responses from human subjects. In this work, we propose directly fine-tuning LLMs to predict response distributions by leveraging unique structural characteristics of survey data. To enable fine-tuning, we curate SubPOP, a significantly scaled dataset of 3,362 questions and 70K subpopulation-response pairs from well-established public opinion surveys. We show that fine-tuning on SubPOP greatly improves the match between LLM predictions and human responses across various subpopulations, reducing the LLM-human gap by up to 46% compared to baselines, and achieves strong generalization to unseen surveys and subpopulations. Our findings highlight the potential of survey-based fine-tuning to improve opinion prediction for diverse, real-world subpopulations and therefore enable more efficient survey designs. Our code is available at https://github.com/JosephJeesungSuh/subpop.
Chinese: 通过在精心构建的SubPOP数据集上对大语言模型进行微调,显著提高了预测人类调查回答分布的准确性,将模型预测与真实人类回答之间的差距缩小了高达46%,从而实现更高效的调查设计。
English: Fine-tuning large language models on the curated SubPOP dataset significantly improves the accuracy of predicting human survey response distributions, reducing the gap between model predictions and actual human responses by up to 46% and enabling more efficient survey design.

Authors:Vladimir Makharev, Vladimir Ivanov
Title: Code Summarization Beyond Function Level
Abstract:
Code summarization is a critical task in natural language processing and software engineering, which aims to generate concise descriptions of source code. Recent advancements have improved the quality of these summaries, enhancing code readability and maintainability. However, the content of a repository or a class has not been considered in function code summarization. This study investigated the effectiveness of code summarization models beyond the function level, exploring the impact of class and repository contexts on the summary quality. The study involved revising benchmarks for evaluating models at class and repository levels, assessing baseline models, and evaluating LLMs with in-context learning to determine the enhancement of summary quality with additional context. The findings revealed that the fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization, while incorporating few-shot learning and retrieved code chunks from RAG significantly enhanced the performance of LLMs in this task. Notably, the Deepseek Coder 1.3B and Starcoder2 15B models demonstrated substantial improvements in metrics such as BLEURT, METEOR, and BLEU-4 at both class and repository levels. Repository-level summarization exhibited promising potential but necessitates significant computational resources and gains from the inclusion of structured context. Lastly, we employed the recent SIDE code summarization metric in our evaluation. This study contributes to refining strategies for prompt engineering, few-shot learning, and RAG, addressing gaps in benchmarks for code summarization at various levels. Finally, we publish all study details, code, datasets, and results of evaluation in the GitHub repository available at https://github.com/kilimanj4r0/code-summarization-beyond-function-level.
中文: 本研究探索了超越函数级别的代码摘要,发现融入类和仓库上下文可显著提升摘要质量,其中微调模型与检索增强生成技术展现出明显优势。
English: This study explores code summarization beyond the function level, revealing that incorporating class and repository contexts significantly enhances summary quality, with fine-tuned models and retrieval-augmented generation showing notable improvements.

Authors:Chenlong Wang, Zhaoyang Chu, Zhengxiang Cheng, Xuyi Yang, Kaiyue Qiu, Yao Wan, Zhou Zhao, Xuanhua Shi, Dongping Chen
Title: CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale
Abstract:
Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs. This limitation, stemming from static pre-training datasets, often results in non-executable code or implementations with suboptimal safety and efficiency. To this end, this paper introduces CODESYNC, a data engine for identifying outdated code patterns and collecting real-time code knowledge updates from Python third-party libraries. Building upon CODESYNC, we develop CODESYNCBENCH, a comprehensive benchmark for assessing LLMs' ability to stay synchronized with code evolution, which covers real-world updates for 220 APIs from six Python libraries. Our benchmark offers 3,300 test cases across three evaluation tasks and an update-aware instruction tuning dataset consisting of 2,200 training samples. Extensive experiments on 14 state-of-the-art LLMs reveal that they struggle with dynamic code evolution, even with the support of advanced knowledge updating methods (e.g., DPO, ORPO, and SimPO). We believe that our benchmark can offer a strong foundation for the development of more effective methods for real-time code knowledge updating in the future. The experimental code and dataset are publicly available at: https://github.com/Lucky-voyage/Code-Sync.
中文摘要:本文提出CODESYNC数据引擎和CODESYNCBENCH基准测试,旨在解决大语言模型在适应持续演变的代码知识方面的不足,发现即使采用先进更新方法,现有模型仍难以应对动态API变更。
English Summary: This paper introduces CODESYNC and CODESYNCBENCH to address LLMs' limitations in adapting to evolving code knowledge, revealing that even advanced models struggle with dynamic API updates despite comprehensive benchmarking and training datasets.

Authors:Zengqing Wu, Takayuki Ito
Title: The Hidden Strength of Disagreement: Unraveling the Consensus-Diversity Tradeoff in Adaptive Multi-Agent Systems
Abstract:
Consensus formation is pivotal in multi-agent systems (MAS), balancing collective coherence with individual diversity. Conventional LLM-based MAS primarily rely on explicit coordination, e.g., prompts or voting, risking premature homogenization. We argue that implicit consensus, where agents exchange information yet independently form decisions via in-context learning, can be more effective in dynamic environments that require long-horizon adaptability. By retaining partial diversity, systems can better explore novel strategies and cope with external shocks. We formalize a consensus-diversity tradeoff, showing conditions where implicit methods outperform explicit ones. Experiments on three scenarios -- Dynamic Disaster Response, Information Spread and Manipulation, and Dynamic Public-Goods Provision -- confirm partial deviation from group norms boosts exploration, robustness, and performance. We highlight emergent coordination via in-context learning, underscoring the value of preserving diversity for resilient decision-making.
中文: 多智能体系统中的隐性共识通过情境学习让智能体交换信息但独立决策,在动态环境中优于显性方法,因其保留多样性从而提升探索能力、鲁棒性和适应性。
English: Implicit consensus in multi-agent systems, where agents exchange information but make independent decisions through in-context learning, preserves diversity and outperforms explicit methods in dynamic environments by enhancing exploration, robustness, and adaptability.

Authors:Jen-Tse Huang, Dasen Dai, Jen-Yuan Huang, Youliang Yuan, Xiaoyuan Liu, Wenxuan Wang, Wenxiang Jiao, Pinjia He, Zhaopeng Tu, Haodong Duan
Title: Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Abstract:
Despite significant progress on popular multimodal benchmarks, state-of-the-art Multimodal Large Language Models (MLLMs) continue to struggle with basic visual reasoning tasks that are trivially solved by humans, such as recognizing spatial relationships. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment. These subtests span four core domains of human visual cognition: (1) Visualization and Spatial Processing, (2) Perceptual and Closure, (3) Memory, and (4) Reasoning. We evaluate 20 frontier MLLMs from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families. The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that current MLLM performance gains on high-level benchmarks do not reflect human-like low-level visual cognition, challenging the assumption that large-scale pretraining naturally induces gestalt-like perceptual capabilities. The dataset and evaluation toolkit are publicly available at: https://github.com/CUHK-ARISE/VisFactor.
Chinese: VisFactor基准测试显示,当前多模态大语言模型在人类基础视觉推理任务上表现不佳,评估了20个模型在四个核心视觉认知领域的表现,最高得分仅为25.19分(满分100分),远低于人类水平。
English: Current Multimodal Large Language Models perform poorly on basic visual reasoning tasks compared to humans, as revealed by the VisFactor benchmark, which evaluates 20 models across four core visual cognition domains and finds the highest score to be only 25.19 out of 100.

Authors:Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai
Title: OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
Abstract:
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.
中文摘要:OmniParser V2通过提出的结构化思维点(SPOT)提示模式,将视觉文本解析中的多个任务统一到单一框架中,简化了处理流程并在多个数据集上取得了领先或竞争性的性能表现。
English Summary: OmniParser V2 introduces a unified framework using Structured-Points-of-Thought (SPOT) prompting to simplify visually-situated text parsing by integrating multiple tasks into a single model, achieving state-of-the-art performance across various datasets.

Authors:William Rudman, Michal Golovanevsky, Amir Bar, Vedant Palit, Yann LeCun, Carsten Eickhoff, Ritambhara Singh
Title: Forgotten Polygons: Multimodal Large Language Models are Shape-Blind
Abstract:
Despite strong performance on vision-language tasks, Multimodal Large Language Models (MLLMs) struggle with mathematical problem-solving, with both open-source and state-of-the-art models falling short of human performance on visual-math benchmarks. To systematically examine visual-mathematical reasoning in MLLMs, we (1) evaluate their understanding of geometric primitives, (2) test multi-step reasoning, and (3) explore a potential solution to improve visual reasoning capabilities. Our findings reveal fundamental shortcomings in shape recognition, with top models achieving under 50% accuracy in identifying regular polygons. We analyze these failures through the lens of dual-process theory and show that MLLMs rely on System 1 (intuitive, memorized associations) rather than System 2 (deliberate reasoning). Consequently, MLLMs fail to count the sides of both familiar and novel shapes, suggesting they have neither learned the concept of sides nor effectively process visual inputs. Finally, we propose Visually Cued Chain-of-Thought (VC-CoT) prompting, which enhances multi-step mathematical reasoning by explicitly referencing visual annotations in diagrams, boosting GPT-4o's accuracy on an irregular polygon side-counting task from 7% to 93%. Our findings suggest that System 2 reasoning in MLLMs remains an open problem, and visually-guided prompting is essential for successfully engaging visual reasoning. Code available at: https://github.com/rsinghlab/Shape-Blind.
Chinese: 多模态大语言模型在视觉数学推理上存在显著缺陷,依赖直觉联想而非系统分析,但通过视觉引导的提示方法(如VC-CoT)可大幅提升其表现。
English: Multimodal Large Language Models exhibit significant deficiencies in visual-mathematical reasoning, relying on intuitive associations rather than deliberate analysis, but their performance can be dramatically improved through visually-guided prompting techniques like VC-CoT.

Authors:Aryan Jadon, Avinash Patil, Shashank Kumar
Title: Enhancing Domain-Specific Retrieval-Augmented Generation: Synthetic Data Generation and Evaluation using Reasoning Models
Abstract:
Retrieval-Augmented Generation (RAG) systems face significant performance gaps when applied to technical domains requiring precise information extraction from complex documents. Current evaluation methodologies relying on document-level metrics inadequately capture token-resolution retrieval accuracy that is critical for domain-related documents. We propose a framework combining granular evaluation metrics with synthetic data generation to optimize domain-specific RAG performance. First, we introduce token-aware metrics Precision $Ω$ and Intersection-over-Union (IoU) that quantify context preservation versus information density trade-offs inherent in technical texts. Second, we develop a reasoning model-driven pipeline using instruction-tuned LLMs (DeepSeek-R1, DeepSeek-R1 distilled variants, and Phi-4) to generate context-anchored QA pairs with discontinuous reference spans across three specialized corpora: SEC 10-K filings (finance), biomedical abstracts (PubMed), and APT threat reports (cybersecurity). Our empirical analysis reveals critical insights: smaller chunks (less than 10 tokens) improve precision by 31-42% (IoU = 0.071 vs. baseline 0.053) at recall costs (-18%), while domain-specific embedding strategies yield 22% variance in optimal chunk sizing (5-20 tokens). The DeepSeek-R1-Distill-Qwen-32B model demonstrates superior concept alignment (+14% mean IoU over alternatives), though no configuration universally dominates. Financial texts favor larger chunks for risk factor coverage (Recall = 0.81 at size = 20), whereas cybersecurity content benefits from atomic segmentation, Precision $Ω= 0.28$ at size = 5. Our code is available on https://github.com/aryan-jadon/Synthetic-Data-Generation-and-Evaluation-using-Reasoning-Model
中文摘要:本研究提出了一种结合细粒度评估指标与合成数据生成的框架,以提升检索增强生成(RAG)在技术领域的性能,发现金融、生物医学和网络安全等专业文献的最佳文本块大小与嵌入策略存在显著差异。
English Summary: This study introduces a framework combining token-level evaluation metrics and synthetic data generation to enhance Retrieval-Augmented Generation (RAG) performance in technical domains, revealing that optimal chunk sizes and embedding strategies vary significantly across specialized corpora like finance, biomedical, and cybersecurity documents.

Authors:Haokun Chen, Sebastian Szyller, Weilin Xu, Nageen Himayat
Title: Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
Abstract:
Large language models (LLMs) are trained using massive datasets, which often contain undesirable content such as harmful texts, personal information, and copyrighted material. To address this, machine unlearning aims to remove information from trained models. Recent work has shown that soft token attacks (STA) can successfully extract unlearned information from LLMs, but in this work we show that STAs can be an inadequate tool for auditing unlearning. Using common benchmarks such as Who Is Harry Potter? and TOFU, we demonstrate that in a strong auditor setting such attacks can elicit any information from the LLM, regardless of the deployed unlearning algorithm or whether the queried content was originally present in the training corpus. We further show that STA with just a few soft tokens (1-10) can elicit random strings over 400 characters long, indicating that STAs must be used carefully to effectively audit unlearning. Example code can be found at: https://github.com/IntelLabs/LLMart/tree/main/examples/unlearning
Chinese: 软令牌攻击无法有效审核机器遗忘,因为它能从大语言模型中提取随机或任意信息,与遗忘算法或训练数据是否包含该内容无关。
English: Soft token attacks can ineffectively audit machine unlearning by extracting random or any information from large language models, regardless of unlearning methods or original training data presence.

Authors:Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang
Title: InductionBench: LLMs Fail in the Simplest Complexity Class
Abstract:
Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.
中文摘要:大型语言模型在演绎推理方面表现出色,但在归纳推理方面存在明显不足,新基准测试InductionBench显示,即使是最先进的模型也难以从数据中推断出基本规则。
English Summary: Large language models excel in deductive reasoning but struggle with inductive reasoning, as shown by the new benchmark InductionBench, which reveals their difficulty in inferring rules from data despite their advanced capabilities.

Authors:Yanyang Li, Michael Lyu, Liwei Wang
Title: Learning to Reason from Feedback at Test-Time
Abstract:
Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.
中文: 大语言模型常需迭代反馈解决复杂任务,本文提出的FTTT范式及其可学习优化器OpTune将反馈利用构建为测试时优化问题,有效克服现有方法的局限,在实验中展现出卓越的扩展性和性能。
English: Large language models often require iterative feedback to solve complex tasks, and the proposed FTTT paradigm with its learnable optimizer OpTune effectively addresses existing limitations by treating feedback utilization as a test-time optimization problem, achieving superior scalability and performance in experiments.

Authors:Zongkai Zhao, Guozeng Xu, Xiuhua Li, Kaiwen Wei, Jiang Zhong
Title: FLEKE: Federated Locate-then-Edit Knowledge Editing
Abstract:
Locate-then-Edit Knowledge Editing (LEKE) is a key technique for updating large language models (LLMs) without full retraining. However, existing methods assume a single-user setting and become inefficient in real-world multi-client scenarios, where decentralized organizations (e.g., hospitals, financial institutions) independently update overlapping knowledge, leading to redundant mediator knowledge vector (MKV) computations and privacy concerns. To address these challenges, we introduce Federated Locate-then-Edit Knowledge Editing (FLEKE), a novel task that enables multiple clients to collaboratively perform LEKE while preserving privacy and reducing computational overhead. To achieve this, we propose FedEdit, a two-stage framework that optimizes MKV selection and reuse. In the first stage, clients locally apply LEKE and upload the computed MKVs. In the second stage, rather than relying solely on server-based MKV sharing, FLEKE allows clients retrieve relevant MKVs based on cosine similarity, enabling knowledge re-edit and minimizing redundant computations. Experimental results on two benchmark datasets demonstrate that FedEdit retains over 96% of the performance of non-federated LEKE while significantly outperforming a FedAvg-based baseline by approximately twofold. Besides, we find that MEMIT performs more consistently than PMET in the FLEKE task with our FedEdit framework. Our code is available at https://github.com/zongkaiz/FLEKE.
Chinese: FLEKE提出了一种联邦式知识编辑方法,允许多个客户端在保护隐私的同时协作更新大语言模型,并通过优化知识向量复用显著减少冗余计算。
English: FLEKE introduces a federated approach to knowledge editing that enables multiple clients to collaboratively update LLMs while preserving privacy and reducing redundant computations through optimized MKV reuse.

Authors:Qi Le, Enmao Diao, Ziyan Wang, Xinran Wang, Jie Ding, Li Yang, Ali Anwar
Title: Probe Pruning: Accelerating LLMs through Dynamic Pruning via Model-Probing
Abstract:
We introduce Probe Pruning (PP), a novel framework for online, dynamic, structured pruning of Large Language Models (LLMs) applied in a batch-wise manner. PP leverages the insight that not all samples and tokens contribute equally to the model's output, and probing a small portion of each batch effectively identifies crucial weights, enabling tailored dynamic pruning for different batches. It comprises three main stages: probing, history-informed pruning, and full inference. In the probing stage, PP selects a small yet crucial set of hidden states, based on residual importance, to run a few model layers ahead. During the history-informed pruning stage, PP strategically integrates the probing states with historical states. Subsequently, it structurally prunes weights based on the integrated states and the PP importance score, a metric developed specifically to assess the importance of each weight channel in maintaining performance. In the final stage, full inference is conducted on the remaining weights. A major advantage of PP is its compatibility with existing models, as it operates without requiring additional neural network modules or fine-tuning. Comprehensive evaluations of PP on LLaMA-2/3 and OPT models reveal that even minimal probing-using just 1.5% of FLOPs-can substantially enhance the efficiency of structured pruning of LLMs. For instance, when evaluated on LLaMA-2-7B with WikiText2, PP achieves a 2.56 times lower ratio of performance degradation per unit of runtime reduction compared to the state-of-the-art method at a 40% pruning ratio. Our code is available at https://github.com/Qi-Le1/Probe_Pruning.
中文: 探针剪枝是一种动态批量剪枝框架,通过选择性探测关键标记和权重来优化大语言模型的结构化剪枝,无需额外模块或微调即可显著提升效率。
English: Probe Pruning is a dynamic, batch-wise framework that enhances structured pruning of Large Language Models by selectively probing key tokens and weights, significantly improving efficiency without extra modules or fine-tuning.

Authors:Wenhao Zhu, Pinzhen Chen, Hanxu Hu, Shujian Huang, Fei Yuan, Jiajun Chen, Alexandra Birch
Title: Generalizing From Short to Long: Effective Data Synthesis for Long-Context Instruction Tuning
Abstract:
Long-context modelling for large language models (LLMs) has been a key area of recent research because many real world use cases require reasoning over longer inputs such as documents. The focus of research into modelling long context has been on how to model position and there has been little investigation into other important aspects of language modelling such as instruction tuning. Long context training examples are challenging and expensive to create and use. In this paper, we investigate how to design instruction data for the post-training phase of a long context pre-trained model: how much and what type of context is needed for optimal and efficient post-training. Our controlled study reveals that models instruction-tuned on short contexts can effectively generalize to longer ones, while also identifying other critical factors such as instruction difficulty and context composition. Based on these findings, we propose context synthesis, a novel data synthesis framework that leverages off-the-shelf LLMs to generate extended background contexts for high-quality instruction-answer pairs. Experiment results on the document-level benchmark (LongBench) demonstrate that our proposed approach outperforms previous instruction synthesis approaches and comes close to the performance of human-annotated long-context instruction data. The project will be available at: https://github.com/NJUNLP/context-synthesis.
中文: 研究发现,基于短文本指令微调的大语言模型能有效泛化至长文本处理,并提出通过数据合成框架生成扩展背景语境的新方法,在长文本基准测试中接近人类标注数据的性能。
English: This study finds that instruction-tuning large language models on short contexts enables effective generalization to longer ones and introduces a novel data synthesis framework that generates extended background contexts, achieving near-human performance on long-context benchmarks.

Authors:Jintian Zhang, Yuqi Zhu, Mengshu Sun, Yujie Luo, Shuofei Qiao, Lun Du, Da Zheng, Huajun Chen, Ningyu Zhang
Title: LightThinker: Thinking Step-by-Step Compression
Abstract:
Large language models (LLMs) have shown remarkable performance in complex reasoning tasks, but their efficiency is hindered by the substantial memory and computational costs associated with generating lengthy tokens. In this paper, we propose LightThinker, a novel method that enables LLMs to dynamically compress intermediate thoughts during reasoning. Inspired by human cognitive processes, LightThinker compresses verbose thought steps into compact representations and discards the original reasoning chains, thereby significantly reducing the number of tokens stored in the context window. This is achieved by training the model on when and how to perform compression through data construction, mapping hidden states to condensed gist tokens, and creating specialized attention masks. Additionally, we introduce the Dependency (Dep) metric to quantify the degree of compression by measuring the reliance on historical tokens during generation. Extensive experiments on four datasets and two models show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy. Our work provides a new direction for improving the efficiency of LLMs in complex reasoning tasks without sacrificing performance. Code is released at https://github.com/zjunlp/LightThinker.
中文摘要:LightThinker是一种创新方法,通过将推理过程中的中间思维动态压缩为紧凑表征,在保持准确性的同时显著降低内存使用和推理时间。
English Summary: LightThinker is a novel method that enhances LLM efficiency by dynamically compressing intermediate reasoning steps into compact representations, reducing memory usage and inference time while preserving accuracy.

Authors:Pengcheng Huang, Zhenghao Liu, Yukun Yan, Haiyan Zhao, Xiaoyuan Yi, Hao Chen, Zhiyuan Liu, Maosong Sun, Tong Xiao, Ge Yu, Chenyan Xiong
Title: ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
Abstract:
Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All codes are available at https://github.com/OpenBMB/ParamMute.
中文: 本研究提出ParamMute框架,通过抑制大型语言模型中特定前馈网络的激活来降低对内部参数的依赖,从而提升检索证据的忠实度,并在新旧基准测试中取得了显著改进效果。
English: The study introduces ParamMute, a framework that suppresses specific feed-forward networks in large language models to reduce reliance on internal knowledge and enhance faithfulness to retrieved evidence, demonstrating significant improvements on new and existing benchmarks.

Authors:Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li
Title: Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
Abstract:
Training stability is a persistent challenge in the pre-training of large language models (LLMs), particularly for architectures such as Post-Norm Transformers, which are prone to gradient explosion and dissipation. In this paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that stabilizes training by explicitly decoupling the scale and distribution of the weight matrix in fully-connected layers. SDD applies a normalization mechanism to regulate activations and a learnable scaling vector to maintain well-conditioned gradients, effectively preventing $\textbf{gradient explosion and dissipation}$. This separation improves optimization efficiency, particularly in deep networks, by ensuring stable gradient propagation. Experimental results demonstrate that our method stabilizes training across various LLM architectures and outperforms existing techniques in different normalization configurations. Furthermore, the proposed method is lightweight and compatible with existing frameworks, making it a practical solution for stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.
中文: 本文提出的尺度分布解耦方法通过分离全连接层权重矩阵的尺度与分布来稳定大语言模型训练,有效防止梯度爆炸与消散,在不同架构中均表现出优越性能。
English: This paper introduces Scale-Distribution Decoupling (SDD), a lightweight method that stabilizes large language model training by separating weight matrix scale and distribution to prevent gradient issues, showing superior performance across architectures.

Authors:Yuan Sun
Title: Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models
Abstract:
For pre-training of MoE (Mixture-of-Experts) models, one of the main issues is unbalanced expert loads, which may cause routing collapse or increased computational overhead. Existing methods contain the Loss-Controlled method and the Loss-Free method, where both the unbalanced degrees at first several training steps are still high and decrease slowly. In this work, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q on each MoE layer that can help change the top-K order of s by solving a binary integer programming with very small time costs. We implement the algorithm on two MoE language models: 16-expert (0.3B) and 64-expert (1.1B). The experimental results show that on both models comparing with the Loss-Controlled method and the Loss-Free method, our algorithm trains models with the lowest perplexities, while saves at least 13% of pre-training time compared with the Loss-Controlled method. Within our current knowledge, this is the first routing algorithm that achieves maintaining load balance status on every expert in every MoE layer from the first step to the last step during the whole pre-training process, while the trained MoE models also perform well. The code material of this work is available at https://github.com/sunyuanLLM/bip_routing_algorithm.
中文: 本文提出了一种基于二进制整数规划的专家负载均衡算法,该算法从训练第一步起即可解决MoE模型中专家负载不均衡的问题,在获得最低困惑度的同时,相比现有方法至少节省13%的预训练时间。
English: This paper introduces a BIP-based expert load balancing algorithm that effectively resolves unbalanced expert loads in MoE models from the first training step, achieving the lowest perplexities and reducing pre-training time by at least 13% compared to existing methods.

Authors:Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, Lav R. Varshney, Praneeth Vepakomma
Title: Fed-SB: A Silver Bullet for Extreme Communication Efficiency and Performance in (Private) Federated LoRA Fine-Tuning
Abstract:
Low-Rank Adaptation (LoRA) has become ubiquitous for efficiently fine-tuning foundation models. However, federated fine-tuning using LoRA is challenging due to suboptimal updates arising from traditional federated averaging of individual adapters. Existing solutions either incur prohibitively high communication cost that scales linearly with the number of clients or suffer from performance degradation due to limited expressivity. We introduce Federated Silver Bullet (Fed-SB), a novel approach for federated fine-tuning of LLMs using LoRA-SB, a recently proposed low-rank adaptation method. LoRA-SB optimally aligns the optimization trajectory with the ideal low-rank full fine-tuning projection by learning a small square matrix (R) between adapters B and A, keeping other components fixed. Direct averaging of R guarantees exact updates, substantially reducing communication cost, which remains independent of the number of clients, and enables scalability. Fed-SB achieves state-of-the-art performance across commonsense reasoning, arithmetic reasoning, and language inference tasks while reducing communication costs by up to 230x. In private settings, Fed-SB further improves performance by (1) reducing trainable parameters, thereby lowering the noise required for differential privacy and (2) avoiding noise amplification introduced by other methods. Overall, Fed-SB establishes a new Pareto frontier in the tradeoff between communication and performance, offering an efficient and scalable solution for both private and non-private federated fine-tuning. Our code is publicly available at https://github.com/CERT-Lab/fed-sb.
中文: Fed-SB 提出了一种基于 LoRA-SB 的高效联邦微调方法,通过学习小型矩阵实现精确更新,通信成本降低高达 230 倍,并在多项推理任务中达到最优性能。
English: Fed-SB introduces an efficient federated fine-tuning method using LoRA-SB, which reduces communication costs by up to 230x and achieves top performance across reasoning tasks by learning a small matrix for exact updates.

Authors:Sanghee Park, Geewook Kim
Title: Evaluating Multimodal Generative AI with Korean Educational Standards
Abstract:
This paper presents the Korean National Educational Test Benchmark (KoNET), a new benchmark designed to evaluate Multimodal Generative AI Systems using Korean national educational tests. KoNET comprises four exams: the Korean Elementary General Educational Development Test (KoEGED), Middle (KoMGED), High (KoHGED), and College Scholastic Ability Test (KoCSAT). These exams are renowned for their rigorous standards and diverse questions, facilitating a comprehensive analysis of AI performance across different educational levels. By focusing on Korean, KoNET provides insights into model performance in less-explored languages. We assess a range of models - open-source, open-access, and closed APIs - by examining difficulties, subject diversity, and human error rates. The code and dataset builder will be made fully open-sourced at https://github.com/naver-ai/KoNET.
中文: KoNET基准通过韩国四个级别的国家级教育考试,全面评估多模态生成式AI系统在较少研究语言中的表现,涵盖多样化的学科和难度。
English: The KoNET benchmark evaluates multimodal generative AI systems using rigorous Korean national educational tests across four levels to analyze performance in less-explored languages and diverse subjects.

Authors:Xuetao Ma, Wenbin Jiang, Hua Huang
Title: Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning
Abstract:
In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be released at https://github.com/maxuetao/CurriculumICL
中文: 本研究提出了一种基于解题逻辑的课程上下文学习策略,通过分析解题步骤选择示例并按难度排序,有效提升了大型语言模型在复杂推理任务中的表现与效率。
English: This study introduces a curriculum in-context learning strategy that selects and orders demonstration examples based on problem-solving logic and difficulty, significantly improving the performance and efficiency of large language models in complex reasoning tasks.

Authors:Xuyang Wu, Jinming Nian, Ting-Ruen Wei, Zhiqiang Tao, Hsin-Tai Wu, Yi Fang
Title: Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning
Abstract:
Recent advances in large language models (LLMs) have enabled automatic generation of chain-of-thought (CoT) reasoning, leading to strong performance on tasks such as math and code. However, when reasoning steps reflect social stereotypes (e.g., those related to gender, race or age), they can reinforce harmful associations and lead to misleading conclusions. We present the first systematic evaluation of social bias within LLM-generated reasoning, focusing on reasoning language models (e.g., DeepSeek-R1, OpenAI o1) that natively produce reasoning chains as part of their answers. Using the BBQ dataset, we analyze both prediction accuracy and reasoning bias across a broad spectrum of models, including instruction-tuned and CoT-augmented variants of DeepSeek-R1 (8B/32B), ChatGPT, and other open-source LLMs. We quantify how biased reasoning steps correlate with incorrect predictions and often lead to stereotype expression. To mitigate reasoning-induced bias, we propose Answer Distribution as Bias Proxy (ADBP), a lightweight mitigation method that detects bias by tracking how model predictions change across incremental reasoning steps. ADBP outperforms Stereotype-free Reasoning Pattern (SfRP) baseline in most cases, mitigating bias and improving the accuracy of LLM outputs. Evaluation and mitigation code is available at https://github.com/elviswxy/LLM_reasoning_bias.
中文: 大型语言模型的最新进展实现了自动思维链推理,但此类推理可能强化有害的社会刻板印象,导致偏见结论;本研究首次系统评估了推理链中的社会偏见,并提出ADBP这一轻量级缓解方法,通过追踪推理步骤中的预测变化来检测偏见,在多数情况下优于基线方法。
English: Recent advances in large language models enable automatic chain-of-thought reasoning, but such reasoning can reinforce harmful social stereotypes, leading to biased conclusions; this study presents the first systematic evaluation of social bias in reasoning chains and proposes ADBP, a lightweight mitigation method that detects bias through prediction changes across reasoning steps, outperforming baseline approaches.

Authors:Feiyang Chen, Yu Cheng, Lei Wang, Yuqing Xia, Ziming Miao, Lingxiao Ma, Fan Yang, Jilong Xue, Zhi Yang, Mao Yang, Haibo Chen
Title: AttentionEngine: A Versatile Framework for Efficient Attention Mechanisms on Diverse Hardware Platforms
Abstract:
Transformers and large language models (LLMs) have revolutionized machine learning, with attention mechanisms at the core of their success. As the landscape of attention variants expands, so too do the challenges of optimizing their performance, particularly across different hardware platforms. Current optimization strategies are often narrowly focused, requiring extensive manual intervention to accommodate changes in model configurations or hardware environments. In this paper, we introduce AttentionEngine, a comprehensive framework designed to streamline the optimization of attention mechanisms across heterogeneous hardware backends. By decomposing attention computation into modular operations with customizable components, AttentionEngine enables flexible adaptation to diverse algorithmic requirements. The framework further automates kernel optimization through a combination of programmable templates and a robust cross-platform scheduling strategy. Empirical results reveal performance gains of up to 10x on configurations beyond the reach of existing methods. AttentionEngine offers a scalable, efficient foundation for developing and deploying attention mechanisms with minimal manual tuning. Our code has been open-sourced and is available at https://github.com/microsoft/AttentionEngine.
Chinese: AttentionEngine 是一个综合性框架,可在多样化硬件平台上自动优化注意力机制,实现高达10倍的性能提升,同时极大减少了人工调优需求。
English: AttentionEngine is a comprehensive framework that automates and optimizes attention mechanisms across diverse hardware platforms, achieving up to 10x performance improvements with minimal manual intervention.

Authors:Shilong Hou, Ruilin Shang, Zi Long, Xianghua Fu, Yin Chen
Title: A General Pseudonymization Framework for Cloud-Based LLMs: Replacing Privacy Information in Controlled Text Generation
Abstract:
An increasing number of companies have begun providing services that leverage cloud-based large language models (LLMs), such as ChatGPT. However, this development raises substantial privacy concerns, as users' prompts are transmitted to and processed by the model providers. Among the various privacy protection methods for LLMs, those implemented during the pre-training and fine-tuning phrases fail to mitigate the privacy risks associated with the remote use of cloud-based LLMs by users. On the other hand, methods applied during the inference phrase are primarily effective in scenarios where the LLM's inference does not rely on privacy-sensitive information. In this paper, we outline the process of remote user interaction with LLMs and, for the first time, propose a detailed definition of a general pseudonymization framework applicable to cloud-based LLMs. The experimental results demonstrate that the proposed framework strikes an optimal balance between privacy protection and utility. The code for our method is available to the public at https://github.com/Mebymeby/Pseudonymization-Framework.
中文: 本文针对云端大语言模型提出了一种假名化框架,旨在解决用户交互过程中的隐私风险,并在隐私保护与实用性之间实现了最佳平衡。
English: This paper introduces a pseudonymization framework for cloud-based large language models to address privacy risks during user interactions, achieving an optimal balance between privacy protection and utility.

Authors:Mengqiao Liu, Tevin Wang, Cassandra A. Cohen, Sarah Li, Chenyan Xiong
Title: Understand User Opinions of Large Language Models via LLM-Powered In-the-Moment User Experience Interviews
Abstract:
Which large language model (LLM) is better? Every evaluation tells a story, but what do users really think about current LLMs? This paper presents CLUE, an LLM-powered interviewer that conducts in-the-moment user experience interviews, right after users interact with LLMs, and automatically gathers insights about user opinions from massive interview logs. We conduct a study with thousands of users to understand user opinions on mainstream LLMs, recruiting users to first chat with a target LLM and then be interviewed by CLUE. Our experiments demonstrate that CLUE captures interesting user opinions, e.g., the bipolar views on the displayed reasoning process of DeepSeek-R1 and demands for information freshness and multi-modality. Our code and data are at https://github.com/cxcscmu/LLM-Interviewer.
中文: 本文提出CLUE,一个由大语言模型驱动的访谈系统,能在用户与模型交互后即时进行用户体验访谈,并通过海量访谈数据自动分析用户对主流模型(如DeepSeek-R1)的真实看法。
English: This paper introduces CLUE, an LLM-powered interviewer that conducts real-time user experience interviews after interactions with LLMs, automatically extracting insights from large-scale logs to reveal user opinions on mainstream models like DeepSeek-R1.

Authors:Jinchuan Tian, Jiatong Shi, William Chen, Siddhant Arora, Yoshiki Masuyama, Takashi Maekaku, Yihan Wu, Junyi Peng, Shikhar Bharadwaj, Yiwen Zhao, Samuele Cornell, Yifan Peng, Xiang Yue, Chao-Han Huck Yang, Graham Neubig, Shinji Watanabe
Title: ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Abstract:
We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.
中文: ESPnet-SpeechLM 是一个开源工具包,通过标准化工作流程、可配置模块和包括 17 亿参数模型在内的用例,简化了语音语言模型和语音驱动应用的开发。
English: ESPnet-SpeechLM is an open toolkit that simplifies the development of speech language models and voice-driven applications through standardized workflows, configurable modules, and demonstrated use cases, including a 1.7B-parameter model.

Authors:Jianglin Lu, Yixuan Liu, Yitian Zhang, Yun Fu
Title: Scale-Free Graph-Language Models
Abstract:
Graph-language models (GLMs) have demonstrated great potential in graph-based semi-supervised learning. A typical GLM consists of two key stages: graph generation and text embedding, which are usually implemented by inferring a latent graph and finetuning a language model (LM), respectively. However, the former often relies on artificial assumptions about the underlying edge distribution, while the latter requires extensive data annotations. To tackle these challenges, this paper introduces a novel GLM that integrates graph generation and text embedding within a unified framework. Specifically, for graph generation, we leverage an inherent characteristic of real edge distribution--the scale-free property--as a structural prior. We unexpectedly find that this natural property can be effectively approximated by a simple k-nearest neighbor (KNN) graph. For text embedding, we develop a graph-based pseudo-labeler that utilizes scale-free graphs to provide complementary supervision for improved LM finetuning. Extensive experiments on representative datasets validate our findings on the scale-free structural approximation of KNN graphs and demonstrate the effectiveness of integrating graph generation and text embedding with a real structural prior. Our code is available at https://github.com/Jianglin954/SFGL.
中文: 本文提出了一种统一的图语言模型,利用KNN图近似真实图的尺度无关特性,无需大量标注即可同时改进图生成和文本嵌入。
English: This paper introduces a unified graph-language model that leverages the scale-free property of real graphs, approximated by KNN graphs, to improve both graph generation and text embedding without extensive annotations.

Authors:Xiaoyu Chen, Changde Du, Che Liu, Yizhe Wang, Huiguang He
Title: BP-GPT: Auditory Neural Decoding Using fMRI-prompted LLM
Abstract:
Decoding language information from brain signals represents a vital research area within brain-computer interfaces, particularly in the context of deciphering the semantic information from the fMRI signal. Although existing work uses LLM to achieve this goal, their method does not use an end-to-end approach and avoids the LLM in the mapping of fMRI-to-text, leaving space for the exploration of the LLM in auditory decoding. In this paper, we introduce a novel method, the Brain Prompt GPT (BP-GPT). By using the brain representation that is extracted from the fMRI as a prompt, our method can utilize GPT-2 to decode fMRI signals into stimulus text. Further, we introduce the text prompt and align the fMRI prompt to it. By introducing the text prompt, our BP-GPT can extract a more robust brain prompt and promote the decoding of pre-trained LLM. We evaluate our BP-GPT on the open-source auditory semantic decoding dataset and achieve a significant improvement up to 4.61 on METEOR and 2.43 on BERTScore across all the subjects compared to the state-of-the-art method. The experimental results demonstrate that using brain representation as a prompt to further drive LLM for auditory neural decoding is feasible and effective. The code is available at https://github.com/1994cxy/BP-GPT.
中文: 本文提出的BP-GPT方法通过将fMRI信号转换为脑提示来驱动GPT-2进行听觉语义解码,相比现有最佳方法实现了显著性能提升。
English: This paper introduces BP-GPT, an end-to-end method that uses fMRI-derived brain prompts to drive GPT-2 for auditory semantic decoding, achieving significant improvements over state-of-the-art methods.

Authors:Tianjie Ju, Bowen Wang, Hao Fei, Mong-Li Lee, Wynne Hsu, Yun Li, Qianren Wang, Pengzhou Cheng, Zongru Wu, Zhuosheng Zhang, Gongshen Liu
Title: Investigating the Adaptive Robustness with Knowledge Conflicts in LLM-based Multi-Agent Systems
Abstract:
Recent advances in Large Language Models (LLMs) have upgraded them from sophisticated text generators to autonomous agents capable of corporation and tool use in multi-agent systems (MASs). However, the robustness of these LLM-based MASs, especially under knowledge conflicts, remains unclear. In this paper, we design four comprehensive metrics to investigate the robustness of MASs when facing mild or task-critical knowledge conflicts. We first analyze mild knowledge conflicts introduced by heterogeneous agents and find that they do not harm system robustness but instead improve collaborative decision-making. Next, we investigate task-critical knowledge conflicts by synthesizing knowledge conflicts and embedding them into one of the agents. Our results show that these conflicts have surprisingly little to no impact on MAS robustness. Furthermore, we observe that MASs demonstrate certain self-repairing capabilities by reducing their reliance on knowledge conflicts and adopting alternative solution paths to maintain stability. Finally, we conduct ablation studies on the knowledge conflict number, agent number, and interaction rounds, finding that the self-repairing capability of MASs has intrinsic limits, and all findings hold consistently across various factors. Our code is publicly available at https://github.com/wbw625/MultiAgentRobustness.
中文: 最新研究表明,大语言模型智能体间的普遍分歧通过避免过早共识和拓展解决方案探索来提升集体决策,而任务关键分歧虽严重阻碍推理任务,但因存在替代解决路径对编程影响有限。
English: Recent research demonstrates that general disagreements among large language model agents enhance collective decision-making by preventing premature consensus and expanding solution exploration, whereas task-critical disagreements significantly hinder reasoning tasks but have minimal impact on programming due to alternative solution paths.

Authors:Yen-Che Hsiao, Abhishek Dutta
Title: Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps
Abstract:
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data, including GPT2, SmolLM2, OpenELM, TinyLlama, Stable LM, and Gemma 2. We identify a critical parameter threshold (~1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning. Specifically, models above this threshold achieve better success rates in chain-of-thought (CoT) prompting for deductive reasoning tasks, especially those requiring longer reasoning chains, such as proof by contradiction and disjunction elimination. To address limitations in sub-threshold models, we demonstrate that fine-tuning with task-specific exemplars substantially enhances reasoning performance, enabling accurate CoT generation even without additional exemplars in the prompt for tasks with shorter reasoning chains. Finally, our analysis of attention maps reveals that models capable of generating correct CoTs exhibit higher token-level attention scores on subsequent correct tokens and the correct parts of speech, providing interpretability insights into reasoning processes. These findings collectively advance understanding of reasoning capabilities in decoder-only transformer-based models. The code is available at: https://github.com/AnnonymousForPapers/CoT_Reasoning_Test.
中文: 本研究发现仅解码器Transformer语言模型需达到约16亿参数的关键阈值才能显著提升推理能力,注意力图分析为思维链生成过程提供了可解释性依据。
English: This study reveals that decoder-only transformer language models require a critical parameter threshold of approximately 1.6 billion to achieve significant reasoning improvements, with attention map analysis providing interpretability for chain-of-thought generation processes.

Authors:Yeonjun In, Wonjoong Kim, Kanghoon Yoon, Sungchul Kim, Mehrab Tanjim, Kibum Kim, Chanyoung Park
Title: Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
Abstract:
As the use of large language model (LLM) agents continues to grow, their safety vulnerabilities have become increasingly evident. Extensive benchmarks evaluate various aspects of LLM safety by defining the safety relying heavily on general standards, overlooking user-specific standards. However, safety standards for LLM may vary based on a user-specific profiles rather than being universally consistent across all users. This raises a critical research question: Do LLM agents act safely when considering user-specific safety standards? Despite its importance for safe LLM use, no benchmark datasets currently exist to evaluate the user-specific safety of LLMs. To address this gap, we introduce U-SAFEBENCH, the first benchmark designed to assess user-specific aspect of LLM safety. Our evaluation of 18 widely used LLMs reveals current LLMs fail to act safely when considering user-specific safety standards, marking a new discovery in this field. To address this vulnerability, we propose a simple remedy based on chain-of-thought, demonstrating its effectiveness in improving user-specific safety. Our benchmark and code are available at https://github.com/yeonjun-in/U-SafeBench.
中文: 本研究提出了首个评估大语言模型用户特定安全性的基准U-SAFEBENCH,发现现有模型无法满足个性化安全标准,并通过思维链方法提出了有效的改进方案。
English: The study introduces U-SAFEBENCH, the first benchmark evaluating user-specific safety in LLMs, revealing current models' failure to meet personalized safety standards and proposing an effective chain-of-thought mitigation strategy.

Authors:Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
Title: UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Abstract:
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or "forgetting" a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model's other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model's representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
中文:UPCORE框架通过选择性修剪待遗忘数据中的异常值,在不同遗忘方法中实现了删除效果与模型效用保持的最佳平衡。
English: The proposed UPCORE framework selectively prunes outliers from data to be forgotten, effectively balancing deletion efficacy and model utility preservation across various unlearning methods.

Authors:Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng
Title: SIFT: Grounding LLM Reasoning in Contexts via Stickers
Abstract:
This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.
中文: 本文提出SIFT方法,通过模型自生成的"贴标"将推理过程锚定于上下文,有效解决了大型语言模型在推理中的语境误解问题,并在多个基准测试中显著提升了性能。
English: This paper introduces SIFT, a post-training method that uses self-generated "Stickers" to enhance LLM reasoning by grounding it in context, significantly improving performance across various models and benchmarks.

Authors:Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, Cheng Yang
Title: PathRAG: Pruning Graph-based Retrieval Augmented Generation with Relational Paths
Abstract:
Retrieval-augmented generation (RAG) improves the response quality of large language models (LLMs) by retrieving knowledge from external databases. Typical RAG approaches split the text database into chunks, organizing them in a flat structure for efficient searches. To better capture the inherent dependencies and structured relationships across the text database, researchers propose to organize textual information into an indexing graph, known asgraph-based RAG. However, we argue that the limitation of current graph-based RAG methods lies in the redundancy of the retrieved information, rather than its insufficiency. Moreover, previous methods use a flat structure to organize retrieved information within the prompts, leading to suboptimal performance. To overcome these limitations, we propose PathRAG, which retrieves key relational paths from the indexing graph, and converts these paths into textual form for prompting LLMs. Specifically, PathRAG effectively reduces redundant information with flow-based pruning, while guiding LLMs to generate more logical and coherent responses with path-based prompting. Experimental results show that PathRAG consistently outperforms state-of-the-art baselines across six datasets and five evaluation dimensions. The code is available at the following link: https://github.com/BUPT-GAMMA/PathRAG
Chinese: PathRAG通过从图索引中提取关键关系路径,利用基于流的剪枝减少冗余信息,并采用基于路径的提示方法提升生成响应的逻辑性,在多个数据集和评估维度上均优于现有先进方法。
English: PathRAG enhances retrieval-augmented generation by extracting key relational paths from a graph index, reducing redundancy through flow-based pruning and improving response coherence with path-based prompting, outperforming existing methods across multiple datasets and evaluation dimensions.

Authors:Shang Yang, Junxian Guo, Haotian Tang, Qinghao Hu, Guangxuan Xiao, Jiaming Tang, Yujun Lin, Zhijian Liu, Yao Lu, Song Han
Title: LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
Abstract:
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks, yet efficiently serving these models remains challenging due to the quadratic computational complexity of attention in the prefilling stage and the large memory footprint of the KV cache in the decoding stage. To address these issues, we introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. This method unifies different hardware-friendly, structured sparsity patterns for both prefilling and decoding attention into a single framework, where computations on less important tokens are skipped block-wise. LServe demonstrates the compatibility of static and dynamic sparsity in long-context LLM attention. This design enables multiplicative speedups by combining these optimizations. Specifically, we convert half of the attention heads to nearly free streaming heads in both the prefilling and decoding stages. Additionally, we find that only a constant number of KV pages is required to preserve long-context and reasoning capabilities, irrespective of context length. We then design a hierarchical KV page selection policy that dynamically prunes KV pages based on query-centric similarity. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM, maintaining long-context accuracy. Code is released at https://github.com/mit-han-lab/omniserve.
中文: LServe系统通过混合稀疏注意力机制高效加速长序列大语言模型服务,在预填充阶段提速最高达2.9倍、解码阶段提速1.3-2.1倍,同时通过跳过次要令牌计算和动态剪枝KV缓存页保持了长上下文处理精度。
English: LServe is an efficient system that accelerates long-sequence LLM serving through hybrid sparse attention, achieving up to 2.9x faster prefilling and 1.3-2.1x faster decoding while maintaining accuracy by skipping computations on less important tokens and dynamically pruning KV pages.

Authors:Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
Title: FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Abstract:
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12$\times$ speedup over the state-of-the-art speculative sampling method EAGLE-2. Code available at https://github.com/thunlp/FR-Spec.
Chinese: FR-Spec提出了一种基于频率排序的推测采样框架,通过压缩词汇空间来加速大语言模型生成,在保持输出质量的同时,将计算开销降低75%,并比现有最优方法提速1.12倍。
English: FR-Spec introduces a frequency-ranked speculative sampling framework that accelerates large language model generation by compressing vocabulary space, achieving a 75% reduction in computational overhead and a 1.12× speedup over existing methods while maintaining output quality.

Authors:Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anastasios N. Angelopoulos, Ion Stoica
Title: Prompt-to-Leaderboard
Abstract:
Large language model (LLM) evaluations typically rely on aggregated metrics like accuracy or human preference, averaging across users and prompts. This averaging obscures user- and prompt-specific variations in model performance. To address this, we propose Prompt-to-Leaderboard (P2L), a method that produces leaderboards specific to a prompt. The core idea is to train an LLM taking natural language prompts as input to output a vector of Bradley-Terry coefficients which are then used to predict the human preference vote. The resulting prompt-dependent leaderboards allow for unsupervised task-specific evaluation, optimal routing of queries to models, personalization, and automated evaluation of model strengths and weaknesses. Data from Chatbot Arena suggest that P2L better captures the nuanced landscape of language model performance than the averaged leaderboard. Furthermore, our findings suggest that P2L's ability to produce prompt-specific evaluations follows a power law scaling similar to that observed in LLMs themselves. In January 2025, the router we trained based on this methodology achieved the #1 spot on the Chatbot Arena leaderboard. Our code is available on GitHub at https://github.com/lmarena/p2l.
Chinese: 作者提出了Prompt-to-Leaderboard(P2L)方法,通过训练大语言模型根据自然语言提示预测人类偏好,生成针对特定提示的排行榜,从而实现个性化模型评估与查询路由,该方法优于传统聚合指标,并于2025年1月在Chatbot Arena排行榜上获得首位。
English: The authors introduce Prompt-to-Leaderboard (P2L), a method that generates prompt-specific leaderboards by training an LLM to predict human preferences, enabling personalized model evaluation and routing, which outperforms traditional aggregated metrics and achieved top ranking on Chatbot Arena in January 2025.

Authors:Jianwen Luo, Yiming Huang, Jinxiang Meng, Fangyu Lei, Shizhu He, Xiao Liu, Shanshan Jiang, Bin Dong, Jun Zhao, Kang Liu
Title: GATE: Graph-based Adaptive Tool Evolution Across Diverse Tasks
Abstract:
Large Language Models (LLMs) have shown great promise in tool-making, yet existing frameworks often struggle to efficiently construct reliable toolsets and are limited to single-task settings. To address these challenges, we propose GATE (Graph-based Adaptive Tool Evolution), an adaptive framework that dynamically constructs and evolves a hierarchical graph of reusable tools across multiple scenarios. We evaluate GATE on open-ended tasks (Minecraft), agent-based tasks (TextCraft, DABench), and code generation tasks (MATH, Date, TabMWP). Our results show that GATE achieves up to 4.3x faster milestone completion in Minecraft compared to the previous SOTA, and provides an average improvement of 9.23% over existing tool-making methods in code generation tasks and 10.03% in agent tasks. GATE demonstrates the power of adaptive evolution, balancing tool quantity, complexity, and functionality while maintaining high efficiency. Code and data are available at \url{https://github.com/ayanami2003/GATE}.
中文: GATE框架通过动态构建和演化分层工具图,在多场景任务中实现了比现有方法更优异的性能表现,在代码生成和智能体任务中分别获得平均9.23%和10.03%的性能提升。
English: The GATE framework dynamically constructs and evolves hierarchical tool graphs across multiple scenarios, achieving significant performance improvements in open-ended, agent-based, and code generation tasks compared to existing methods.

Authors:Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark
Title: Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Abstract:
Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.

Authors:Shangqing Tu, Yucheng Wang, Daniel Zhang-Li, Yushi Bai, Jifan Yu, Yuhao Wu, Lei Hou, Huiqin Liu, Zhiyuan Liu, Bin Xu, Juanzi Li
Title: LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models
Abstract:
Existing Large Vision-Language Models (LVLMs) can process inputs with context lengths up to 128k visual and text tokens, yet they struggle to generate coherent outputs beyond 1,000 words. We find that the primary limitation is the absence of long output examples during supervised fine-tuning (SFT). To tackle this issue, we introduce LongWriter-V-22k, a SFT dataset comprising 22,158 examples, each with multiple input images, an instruction, and corresponding outputs ranging from 0 to 10,000 words. Moreover, to achieve long outputs that maintain high-fidelity to the input images, we employ Direct Preference Optimization (DPO) to the SFT model. Given the high cost of collecting human feedback for lengthy outputs (e.g., 3,000 words), we propose IterDPO, which breaks long outputs into segments and uses iterative corrections to form preference pairs with the original outputs. Additionally, we develop MMLongBench-Write, a benchmark featuring six tasks to evaluate the long-generation capabilities of VLMs. Our 7B parameter model, trained with LongWriter-V-22k and IterDPO, achieves impressive performance on this benchmark, outperforming larger proprietary models like GPT-4o. Code and data: https://github.com/THU-KEG/LongWriter-V
中文:现有大型视觉语言模型因缺乏长输出示例而难以生成连贯长文本,我们通过LongWriter-V-22k数据集和IterDPO方法提升了生成长度与保真度,使7B参数模型在性能上超越GPT-4o等大型专有模型。
English: Existing LVLMs struggle with long coherent outputs due to lacking long output examples in SFT, so we introduce LongWriter-V-22k dataset and IterDPO method to enhance generation fidelity and length, achieving superior performance with a 7B model over larger proprietary models.

Authors:Danni Liu, Jan Niehues
Title: Middle-Layer Representation Alignment for Cross-Lingual Transfer in Fine-Tuned LLMs
Abstract:
While large language models demonstrate remarkable capabilities at task-specific applications through fine-tuning, extending these benefits across diverse languages is essential for broad accessibility. However, effective cross-lingual transfer is hindered by LLM performance gaps across languages and the scarcity of fine-tuning data in many languages. Through analysis of LLM internal representations from over 1,000+ language pairs, we discover that middle layers exhibit the strongest potential for cross-lingual alignment. Building on this finding, we propose a middle-layer alignment objective integrated into task-specific training. Our experiments on slot filling, machine translation, and structured text generation show consistent improvements in cross-lingual transfer, especially to lower-resource languages. The method is robust to the choice of alignment languages and generalizes to languages unseen during alignment. Furthermore, we show that separately trained alignment modules can be merged with existing task-specific modules, improving cross-lingual capabilities without full re-training. Our code is publicly available (https://github.com/dannigt/mid-align).
中文摘要:本研究提出一种中间层对齐方法,通过利用来自1000多种语言对的内部表征来增强语言模型的跨语言迁移能力,在多项任务中尤其是低资源语言上展现出持续改进效果。
English Summary: This study introduces a middle-layer alignment method that enhances cross-lingual transfer in language models by leveraging internal representations from over 1,000 language pairs, demonstrating consistent improvements across multiple tasks especially for lower-resource languages.

Authors:Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, Yu Su
Title: From RAG to Memory: Non-Parametric Continual Learning for Large Language Models
Abstract:
Our ability to continuously acquire, organize, and leverage knowledge is a key feature of human intelligence that AI systems must approximate to unlock their full potential. Given the challenges in continual learning with large language models (LLMs), retrieval-augmented generation (RAG) has become the dominant way to introduce new information. However, its reliance on vector retrieval hinders its ability to mimic the dynamic and interconnected nature of human long-term memory. Recent RAG approaches augment vector embeddings with various structures like knowledge graphs to address some of these gaps, namely sense-making and associativity. However, their performance on more basic factual memory tasks drops considerably below standard RAG. We address this unintended deterioration and propose HippoRAG 2, a framework that outperforms standard RAG comprehensively on factual, sense-making, and associative memory tasks. HippoRAG 2 builds upon the Personalized PageRank algorithm used in HippoRAG and enhances it with deeper passage integration and more effective online use of an LLM. This combination pushes this RAG system closer to the effectiveness of human long-term memory, achieving a 7% improvement in associative memory tasks over the state-of-the-art embedding model while also exhibiting superior factual knowledge and sense-making memory capabilities. This work paves the way for non-parametric continual learning for LLMs. Code and data are available at https://github.com/OSU-NLP-Group/HippoRAG.
Chinese: HippoRAG 2框架通过深度融合段落分析和优化大语言模型在线使用,在事实记忆、意义构建和联想记忆任务上全面超越标准检索增强生成方法,为实现大语言模型的非参数持续学习开辟了新途径。
English: The HippoRAG 2 framework significantly outperforms standard retrieval-augmented generation by integrating deeper passage analysis and enhanced LLM utilization, achieving superior performance in factual, sense-making, and associative memory tasks while advancing non-parametric continual learning for AI systems.

Authors:Jeonghun Baek, Akiko Aizawa, Kiyoharu Aizawa
Title: Harnessing PDF Data for Improving Japanese Large Multimodal Models
Abstract:
Large Multimodal Models (LMMs) have demonstrated strong performance in English, but their effectiveness in Japanese remains limited due to the lack of high-quality training data. Current Japanese LMMs often rely on translated English datasets, restricting their ability to capture Japan-specific cultural knowledge. To address this, we explore the potential of Japanese PDF data as a training resource, an area that remains largely underutilized. We introduce a fully automated pipeline that leverages pretrained models to extract image-text pairs from PDFs through layout analysis, OCR, and vision-language pairing, removing the need for manual annotation. Additionally, we construct instruction data from extracted image-text pairs to enrich the training data. To evaluate the effectiveness of PDF-derived data, we train Japanese LMMs and assess their performance on the Japanese LMM Benchmark. Our results demonstrate substantial improvements, with performance gains ranging from 2.1% to 13.8% on Heron-Bench. Further analysis highlights the impact of PDF-derived data on various factors, such as model size and language models, reinforcing its value as a multimodal resource for Japanese LMMs.
Chinese: 本研究开发了一种从日语PDF中自动提取图文对的流程,利用这一未充分开发的资源显著提升了日语大型多模态模型的性能,在基准测试中实现了最高13.8%的性能提升。
English: This study introduces an automated pipeline to extract image-text pairs from Japanese PDFs, significantly enhancing the performance of Japanese Large Multimodal Models by leveraging this underutilized resource and achieving up to 13.8% improvement on benchmarks.

Authors:Priyanka Kargupta, Ishika Agarwal, Tal August, Jiawei Han
Title: Tree-of-Debate: Multi-Persona Debate Trees Elicit Critical Thinking for Scientific Comparative Analysis
Abstract:
With the exponential growth of research facilitated by modern technology and improved accessibility, scientific discoveries have become increasingly fragmented within and across fields. This makes it challenging to assess the significance, novelty, incremental findings, and equivalent ideas between related works, particularly those from different research communities. Large language models (LLMs) have recently demonstrated strong quantitative and qualitative reasoning abilities, and multi-agent LLM debates have shown promise in handling complex reasoning tasks by exploring diverse perspectives and reasoning paths. Inspired by this, we introduce Tree-of-Debate (ToD), a framework which converts scientific papers into LLM personas that debate their respective novelties. To emphasize structured, critical reasoning rather than focusing solely on outcomes, ToD dynamically constructs a debate tree, enabling fine-grained analysis of independent novelty arguments within scholarly articles. Through experiments on scientific literature across various domains, evaluated by expert researchers, we demonstrate that ToD generates informative arguments, effectively contrasts papers, and supports researchers in their literature review.
中文摘要:Tree-of-Debate框架将科学论文转化为大语言模型角色进行结构化辩论,通过动态构建辩论树来分析论文的创新性并对比研究成果,有效辅助研究者进行跨领域的文献综述。
English Summary: The Tree-of-Debate framework transforms scientific papers into LLM personas that engage in structured debates to analyze novelty and contrast findings, aiding researchers in literature reviews across various domains.

Authors:Jianling Li, Shangzhan Li, Zhenye Gao, Qi Shi, Yuxuan Li, Zefan Wang, Jiacheng Huang, Haojie Wang, Jianrong Wang, Xu Han, Zhiyuan Liu, Maosong Sun
Title: TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
Abstract:
Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation. TritonBench will be available at https://github.com/thunlp/TritonBench.
中文: TritonBench作为首个全面的Triton代码生成基准,不仅评估功能正确性还关注性能效率,揭示了当前大型语言模型在生成优化GPU算子方面存在显著不足。
English: TritonBench is introduced as the first comprehensive benchmark for evaluating Triton code generation, focusing on both functional correctness and efficiency performance, revealing that current LLMs struggle to produce optimized GPU operators.

Authors:Yilei Jiang, Xinyan Gao, Tianshuo Peng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng, Xiangyu Yue
Title: HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States
Abstract:
The integration of additional modalities increases the susceptibility of large vision-language models (LVLMs) to safety risks, such as jailbreak attacks, compared to their language-only counterparts. While existing research primarily focuses on post-hoc alignment techniques, the underlying safety mechanisms within LVLMs remain largely unexplored. In this work , we investigate whether LVLMs inherently encode safety-relevant signals within their internal activations during inference. Our findings reveal that LVLMs exhibit distinct activation patterns when processing unsafe prompts, which can be leveraged to detect and mitigate adversarial inputs without requiring extensive fine-tuning. Building on this insight, we introduce HiddenDetect, a novel tuning-free framework that harnesses internal model activations to enhance safety. Experimental results show that {HiddenDetect} surpasses state-of-the-art methods in detecting jailbreak attacks against LVLMs. By utilizing intrinsic safety-aware patterns, our method provides an efficient and scalable solution for strengthening LVLM robustness against multimodal threats. Our code will be released publicly at https://github.com/leigest519/HiddenDetect.
中文: 研究发现大型视觉语言模型在处理不安全内容时会产生独特的内部激活模式,据此开发的HiddenDetect框架无需调优即可利用这些模式有效检测和防御多模态越狱攻击。
English: This study reveals that large vision-language models exhibit distinct internal activation patterns when processing unsafe content, leading to the development of HiddenDetect—a tuning-free framework that leverages these patterns to effectively detect and mitigate multimodal jailbreak attacks.

Authors:Zichun Yu, Fei Peng, Jie Lei, Arnold Overwijk, Wen-tau Yih, Chenyan Xiong
Title: Group-Level Data Selection for Efficient Pretraining
Abstract:
In this paper, we introduce Group-MATES, an efficient group-level data selection approach to optimize the speed-quality frontier of language model pretraining. Specifically, Group-MATES parameterizes costly group-level selection with a relational data influence model. To train this model, we sample training trajectories of the language model and collect oracle data influences alongside. The relational data influence model approximates the oracle data influence by weighting individual influence with relationships among training data. To enable efficient selection with our relational data influence model, we partition the dataset into small clusters using relationship weights and select data within each cluster independently. Experiments on DCLM 400M-4x, 1B-1x, and 3B-1x show that Group-MATES achieves 3.5%-9.4% relative performance gains over random selection across 22 downstream tasks, nearly doubling the improvements achieved by state-of-the-art individual data selection baselines. Furthermore, Group-MATES reduces the number of tokens required to reach a certain downstream performance by up to 1.75x, substantially elevating the speed-quality frontier. Further analyses highlight the critical role of relationship weights in the relational data influence model and the effectiveness of our cluster-based inference. Our code is open-sourced at https://github.com/facebookresearch/Group-MATES.
Chinese: Group-MATES 提出了一种基于关系数据影响模型的高效群组级数据选择方法,用于优化语言模型预训练的速度与质量边界,实验表明其性能显著提升且所需标记数量最多减少1.75倍。
English: Group-MATES introduces an efficient group-level data selection method using a relational data influence model to optimize language model pretraining, achieving significant performance gains and reducing token requirements by up to 1.75x in experiments.

Authors:Zujie Liang, Feng Wei, Wujiang Xu, Lin Chen, Yuxi Qian, Xinhui Wu
Title: I-MCTS: Enhancing Agentic AutoML via Introspective Monte Carlo Tree Search
Abstract:
Recent advancements in large language models (LLMs) have shown remarkable potential in automating machine learning tasks. However, existing LLM-based agents often struggle with low-diversity and suboptimal code generation. While recent work has introduced Monte Carlo Tree Search (MCTS) to address these issues, limitations persist in the quality and diversity of thoughts generated, as well as in the scalar value feedback mechanisms used for node selection. In this study, we introduce Introspective Monte Carlo Tree Search (I-MCTS), a novel approach that iteratively expands tree nodes through an introspective process that meticulously analyzes solutions and results from parent and sibling nodes. This facilitates a continuous refinement of the node in the search tree, thereby enhancing the overall decision-making process. Furthermore, we integrate a Large Language Model (LLM)-based value model to facilitate direct evaluation of each node's solution prior to conducting comprehensive computational rollouts. A hybrid rewarding mechanism is implemented to seamlessly transition the Q-value from LLM-estimated scores to actual performance scores. This allows higher-quality nodes to be traversed earlier. Applied to the various ML tasks, our approach demonstrates a 6% absolute improvement in performance compared to the strong open-source AutoML agents, showcasing its effectiveness in enhancing agentic AutoML systems. Resource available at https://github.com/jokieleung/I-MCTS
Chinese: 本研究提出内省蒙特卡洛树搜索(I-MCTS),通过节点内省优化和混合奖励机制提升自动化机器学习决策能力,相比现有AutoML代理实现性能绝对提升6%。
English: This study introduces Introspective Monte Carlo Tree Search (I-MCTS), which enhances decision-making in automated machine learning by refining nodes through introspection and employing a hybrid rewarding mechanism, achieving a 6% performance improvement over existing AutoML agents.

Authors:Gengxu Li, Tingyu Xia, Yi Chang, Yuan Wu
Title: Length-Controlled Margin-Based Preference Optimization without Reference Model
Abstract:
Direct Preference Optimization (DPO) is a widely adopted offline algorithm for preference-based reinforcement learning from human feedback (RLHF), designed to improve training simplicity and stability by redefining reward functions. However, DPO is hindered by several limitations, including length bias, memory inefficiency, and probability degradation. To address these challenges, we propose Length-Controlled Margin-Based Preference Optimization (LMPO), a more efficient and robust alternative. LMPO introduces a uniform reference model as an upper bound for the DPO loss, enabling a more accurate approximation of the original optimization objective. Additionally, an average log-probability optimization strategy is employed to minimize discrepancies between training and inference phases. A key innovation of LMPO lies in its Length-Controlled Margin-Based loss function, integrated within the Bradley-Terry framework. This loss function regulates response length while simultaneously widening the margin between preferred and rejected outputs. By doing so, it mitigates probability degradation for both accepted and discarded responses, addressing a significant limitation of existing methods. We evaluate LMPO against state-of-the-art preference optimization techniques on two open-ended large language models, Mistral and LLaMA3, across six conditional benchmarks. Our experimental results demonstrate that LMPO effectively controls response length, reduces probability degradation, and outperforms existing approaches. The code is available at https://github.com/gengxuli/LMPO.
中文: 作者提出了长度控制边际偏好优化(LMPO)方法,通过引入统一参考模型和创新的损失函数来克服直接偏好优化的缺陷,有效控制响应长度并减少概率衰减,在Mistral和LLaMA3模型上的多项测试中表现出更优性能。
English: The authors introduce Length-Controlled Margin-Based Preference Optimization (LMPO) to overcome Direct Preference Optimization's limitations by using a uniform reference model and a novel loss function that controls response length and reduces probability degradation, showing superior performance on Mistral and LLaMA3 models across multiple benchmarks.

Authors:Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber
Title: NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization
Abstract:
Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.
Chinese: 本研究提出了Navig图像地理定位框架,利用高质量数据集NaviClues通过语言推理增强分析能力,仅需少量训练样本即可将平均距离误差降低14%,优于现有最优模型。
English: The study introduces Navig, a novel image geo-localization framework that leverages a high-quality dataset called NaviClues to enhance reasoning with language, reducing average distance error by 14% over previous models with minimal training data.

Authors:Eric Egli, Matteo Manica, Jannis Born
Title: Multiscale Byte Language Models -- A Hierarchical Architecture for Causal Million-Length Sequence Modeling
Abstract:
Bytes form the basis of the digital world and thus are a promising building block for multimodal foundation models. Recently, Byte Language Models (BLMs) have emerged to overcome tokenization, yet the excessive length of bytestreams requires new architectural paradigms. Therefore, we present the Multiscale Byte Language Model (MBLM), a model-agnostic hierarchical decoder stack that allows training with context windows of $5$M bytes on single GPU in full model precision. We thoroughly examine MBLM's performance with Transformer and Mamba blocks on both unimodal and multimodal tasks. Our experiments demonstrate that hybrid architectures are efficient in handling extremely long byte sequences during training while achieving near-linear generational efficiency. To the best of our knowledge, we present the first evaluation of BLMs on visual Q\&A tasks and find that, despite serializing images and the absence of an encoder, a MBLM with pure next token prediction can match custom CNN-LSTM architectures with designated classification heads. We show that MBLMs exhibit strong adaptability in integrating diverse data representations, including pixel and image filestream bytes, underlining their potential toward omnimodal foundation models. Source code is publicly available at: https://github.com/ai4sd/multiscale-byte-lm
中文摘要:多尺度字节语言模型(MBLM)提出分层解码器架构,可在单GPU上高效处理百万字节序列训练,通过纯下一词元预测在多模态任务中实现与定制模型相媲美的性能,无需专用编码器。
English Summary: The Multiscale Byte Language Model (MBLM) introduces a hierarchical decoder architecture enabling efficient training on million-byte sequences with standard GPUs, demonstrating competitive performance in multimodal tasks through pure next-token prediction without specialized encoders.

Authors:Yupeng Chang, Chenlu Guo, Yi Chang, Yuan Wu
Title: LoRA-MGPO: Mitigating Double Descent in Low-Rank Adaptation via Momentum-Guided Perturbation Optimization
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), enable efficient adaptation of large language models (LLMs) via low-rank matrix optimization with frozen weights. However, LoRA typically exhibits "double descent" in training loss as rank increases, characterized by a three-phase dynamics: initial convergence, transient divergence, and eventual stabilization. This non-monotonic behavior delays convergence and impairs generalization through unstable gradients and attraction to sharp minima. To address these challenges, we propose LoRA-MGPO, a novel LoRA-based framework incorporating Momentum-Guided Perturbation Optimization (MGPO). First, MGPO eliminates Sharpness-Aware Minimization (SAM)'s dual gradient computations by reusing momentum vectors from optimizer states to guide perturbation directions. This retains SAM's training stability and flat minima preference with maintained efficiency. Second, MGPO incorporates adaptive perturbation normalization, scaling perturbation intensity via exponential moving average (EMA)-smoothed gradient magnitudes. Experiments on natural language understanding and generation benchmarks demonstrate that LoRA-MGPO outperforms LoRA and state-of-the-art PEFT methods. Further analysis confirms its ability to stabilize training and reduce sharp minima attraction, with smoother loss curves and improved convergence behavior. The code is available at https://github.com/llm172/LoRA-MGPO
中文: 提出的LoRA-MGPO框架通过动量引导扰动优化技术增强LoRA,在保持效率的同时有效稳定训练过程并提升语言任务性能。
English: The proposed LoRA-MGPO framework enhances LoRA by integrating Momentum-Guided Perturbation Optimization, which stabilizes training and improves performance on language tasks without sacrificing efficiency.

Authors:Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, Qing Guo
Title: CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models
Abstract:
Large Language Model-based Multi-Agent Systems (LLM-MASs) have demonstrated remarkable real-world capabilities, effectively collaborating to complete complex tasks. While these systems are designed with safety mechanisms, such as rejecting harmful instructions through alignment, their security remains largely unexplored. This gap leaves LLM-MASs vulnerable to targeted disruptions. In this paper, we introduce Contagious Recursive Blocking Attacks (Corba), a novel and simple yet highly effective attack that disrupts interactions between agents within an LLM-MAS. Corba leverages two key properties: its contagious nature allows it to propagate across arbitrary network topologies, while its recursive property enables sustained depletion of computational resources. Notably, these blocking attacks often involve seemingly benign instructions, making them particularly challenging to mitigate using conventional alignment methods. We evaluate Corba on two widely-used LLM-MASs, namely, AutoGen and Camel across various topologies and commercial models. Additionally, we conduct more extensive experiments in open-ended interactive LLM-MASs, demonstrating the effectiveness of Corba in complex topology structures and open-source models. Our code is available at: https://github.com/zhrli324/Corba.
中文: 本文提出Corba攻击,这种具有传染性和递归性的方法能通过看似无害的指令在网络中传播并持续消耗资源,有效破坏基于大语言模型的多智能体系统,对传统安全防护机制构成挑战。
English: This paper introduces Corba, a contagious and recursive attack that effectively disrupts LLM-based multi-agent systems by propagating across networks and depleting resources through seemingly harmless instructions, challenging conventional safety measures.

Authors:Jinnan Li, Jinzhe Li, Yue Wang, Yi Chang, Yuan Wu
Title: StructFlowBench: A Structured Flow Benchmark for Multi-turn Instruction Following
Abstract:
Multi-turn instruction following capability constitutes a core competency of large language models (LLMs) in real-world applications. Existing evaluation benchmarks predominantly focus on fine-grained constraint satisfaction and domain-specific capability assessment, yet overlook the crucial structural dependencies between dialogue turns that distinguish multi-turn from single-turn interactions. These structural dependencies not only reflect user intent but also establish an essential second dimension for the instruction following evaluation beyond constraint satisfaction. To address this gap, we propose StructFlowBench, a multi-turn instruction following benchmark with structural flow modeling. The benchmark defines an innovative structural flow framework with six fundamental inter-turn relationships. These relationships introduce novel structural constraints for model evaluation and also serve as generation parameters for creating customized dialogue flows tailored to specific scenarios. Adopting established LLM-based automatic evaluation methodologies, we conduct systematic evaluations of 13 leading open-source and closed-source LLMs. Experimental results reveal significant deficiencies in current models' comprehension of multi-turn dialogue structures. The code is available at https://github.com/MLGroupJLU/StructFlowBench.
中文摘要:StructFlowBench是一个新的多轮对话评估基准,专门用于测试大语言模型对对话结构依赖关系的理解能力,实验结果表明当前模型在这方面存在明显不足。
English Summary: StructFlowBench is a new benchmark designed to evaluate large language models' ability to handle structural dependencies in multi-turn conversations, revealing significant shortcomings in current models' understanding of dialogue flow.

Authors:Lorenzo Pacchiardi, Konstantinos Voudouris, Ben Slater, Fernando Martínez-Plumed, José Hernández-Orallo, Lexin Zhou, Wout Schellaert
Title: PredictaBoard: Benchmarking LLM Score Predictability
Abstract:
Despite possessing impressive skills, Large Language Models (LLMs) often fail unpredictably, demonstrating inconsistent success in even basic common sense reasoning tasks. This unpredictability poses a significant challenge to ensuring their safe deployment, as identifying and operating within a reliable "safe zone" is essential for mitigating risks. To address this, we present PredictaBoard, a novel collaborative benchmarking framework designed to evaluate the ability of score predictors (referred to as assessors) to anticipate LLM errors on specific task instances (i.e., prompts) from existing datasets. PredictaBoard evaluates pairs of LLMs and assessors by considering the rejection rate at different tolerance errors. As such, PredictaBoard stimulates research into developing better assessors and making LLMs more predictable, not only with a higher average performance. We conduct illustrative experiments using baseline assessors and state-of-the-art LLMs. PredictaBoard highlights the critical need to evaluate predictability alongside performance, paving the way for safer AI systems where errors are not only minimised but also anticipated and effectively mitigated. Code for our benchmark can be found at https://github.com/Kinds-of-Intelligence-CFI/PredictaBoard
Chinese: PredictaBoard是一个协作式基准测试框架,用于评估评分预测器对特定任务中大型语言模型错误的预判能力,通过前瞻性风险防控推动构建更安全、更可预测的人工智能系统。
English: PredictaBoard is a collaborative benchmarking framework that evaluates how well assessors can predict LLM errors on specific prompts, aiming to enhance LLM predictability and safety by anticipating and mitigating risks beyond just improving average performance.

Authors:Moxin Li, Yuantao Zhang, Wenjie Wang, Wentao Shi, Zhuo Liu, Fuli Feng, Tat-Seng Chua
Title: Self-Improvement Towards Pareto Optimality: Mitigating Preference Conflicts in Multi-Objective Alignment
Abstract:
Multi-Objective Alignment (MOA) aims to align LLMs' responses with multiple human preference objectives, with Direct Preference Optimization (DPO) emerging as a prominent approach. However, we find that DPO-based MOA approaches suffer from widespread preference conflicts in the data, where different objectives favor different responses. This results in conflicting optimization directions, hindering the optimization on the Pareto Front. To address this, we propose to construct Pareto-optimal responses to resolve preference conflicts. To efficiently obtain and utilize such responses, we propose a self-improving DPO framework that enables LLMs to self-generate and select Pareto-optimal responses for self-supervised preference alignment. Extensive experiments on two datasets demonstrate the superior Pareto Front achieved by our framework compared to various baselines. Code is available at https://github.com/zyttt-coder/SIPO.
Chinese: 该研究提出了一种自改进的直接偏好优化框架,使大语言模型能够自我生成并选择帕累托最优响应,有效解决偏好冲突,在多目标对齐方面相比现有方法实现了更优的性能。
English: The study introduces a self-improving Direct Preference Optimization framework that enables large language models to self-generate and select Pareto-optimal responses, effectively resolving preference conflicts and achieving superior multi-objective alignment compared to existing methods.

Authors:Ruichen Shao, Bei Li, Gangao Liu, Yang Chen, Xiang Zhou, Jingang Wang, Xunliang Cai, Peng Li
Title: Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective
Abstract:
Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes. Furthermore, additional experiments on mathematical and reasoning benchmarks (MMLU, GSM8K, and MATH) confirm that our method enhances performance without compromising general capabilities. Our codebase would be available at \url{https://github.com/LotuSrc/D2PO}.
Chinese: 提出的增强偏好优化方法引入时间衰减因子,根据标记位置动态调整奖励权重,有效缓解DPO的长度偏差,在基准测试中性能提升5.9-8.8分,同时保持模型的通用能力。
English: The proposed enhanced preference optimization method introduces a temporal decay factor to dynamically weight rewards based on token position, effectively mitigating DPO's length bias and improving alignment performance by 5.9-8.8 points on benchmarks while preserving general capabilities.

Authors:Avinash Patil, Siru Tao, Aryan Jadon
Title: English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports
Abstract:
Accurate translation of bug reports is critical for efficient collaboration in global software development. In this study, we conduct the first comprehensive evaluation of machine translation (MT) performance on bug reports, analyzing the capabilities of DeepL, AWS Translate, and large language models such as ChatGPT, Claude, Gemini, LLaMA, and Mistral using data from the Visual Studio Code GitHub repository, specifically focusing on reports labeled with the english-please tag. To assess both translation quality and source language identification accuracy, we employ a range of MT evaluation metrics-including BLEU, BERTScore, COMET, METEOR, and ROUGE-alongside classification metrics such as accuracy, precision, recall, and F1-score. Our findings reveal that while ChatGPT (gpt-4o) excels in semantic and lexical translation quality, it does not lead in source language identification. Claude and Mistral achieve the highest F1-scores (0.7182 and 0.7142, respectively), and Gemini records the best precision (0.7414). AWS Translate shows the highest accuracy (0.4717) in identifying source languages. These results highlight that no single system dominates across all tasks, reinforcing the importance of task-specific evaluations. This study underscores the need for domain adaptation when translating technical content and provides actionable insights for integrating MT into bug-triaging workflows. The code and dataset for this paper are available at GitHub-https://github.com/av9ash/English-Please
中文摘要:本研究评估了多种机器翻译系统处理错误报告的表现,发现ChatGPT在翻译质量上最优,而Claude和Mistral在源语言识别方面领先,表明没有单一系统能在所有任务中全面胜出。
English Summary: This study evaluates machine translation systems for bug reports, finding that ChatGPT excels in translation quality while Claude and Mistral lead in source language identification, demonstrating no single system performs best across all tasks.

Authors:Jing Xiong, Jianghan Shen, Chuanyang Zheng, Zhongwei Wan, Chenyang Zhao, Chiwun Yang, Fanghua Ye, Hongxia Yang, Lingpeng Kong, Ngai Wong
Title: ParallelComp: Parallel Long-Context Compressor for Length Extrapolation
Abstract:
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs), as most training-free extrapolation methods are not only severely limited by memory bottlenecks, but also suffer from the attention sink, which restricts their scalability and effectiveness in practice. In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck, enabling 8B-parameter LLMs to extrapolate from 8K to 128K tokens on a single A100 80GB GPU in a training-free setting. ParallelComp splits the input into chunks, dynamically evicting redundant chunks and irrelevant tokens, supported by a parallel KV cache eviction mechanism. Importantly, we present a systematic theoretical and empirical analysis of attention biases in parallel attention-including the attention sink, recency bias, and middle bias-and reveal that these biases exhibit distinctive patterns under ultra-long context settings. We further design a KV cache eviction technique to mitigate this phenomenon. Experimental results show that ParallelComp enables an 8B model (trained on 8K context) to achieve 91.17% of GPT-4's performance under ultra-long contexts, outperforming closed-source models such as Claude-2 and Kimi-Chat. We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss and pave the way for scalable and robust ultra-long contexts extrapolation in LLMs. We release the code at https://github.com/menik1126/ParallelComp.
中文: ParallelComp是一种无需训练的并行压缩方法,通过克服内存瓶颈和注意力偏差,使80亿参数大模型在单GPU上实现从8K到128K标记的上下文扩展,性能接近GPT-4且大幅提升处理速度。
English: ParallelComp is a training-free parallel compression method that enables 8B LLMs to extrapolate from 8K to 128K tokens on a single GPU by overcoming memory bottlenecks and attention biases, achieving near-GPT-4 performance with significant speed improvements.

Authors:Yurong Wu, Fangwen Mu, Qiuhong Zhang, Jinjing Zhao, Xinrun Xu, Lingrui Mei, Yang Wu, Lin Shi, Junjie Wang, Zhiming Ding, Yiwei Wang
Title: Vulnerability of Text-to-Image Models to Prompt Template Stealing: A Differential Evolution Approach
Abstract:
Prompt trading has emerged as a significant intellectual property concern in recent years, where vendors entice users by showcasing sample images before selling prompt templates that can generate similar images. This work investigates a critical security vulnerability: attackers can steal prompt templates using only a limited number of sample images. To investigate this threat, we introduce Prism, a prompt-stealing benchmark consisting of 50 templates and 450 images, organized into Easy and Hard difficulty levels. To identify the vulnerabity of VLMs to prompt stealing, we propose EvoStealer, a novel template stealing method that operates without model fine-tuning by leveraging differential evolution algorithms. The system first initializes population sets using multimodal large language models (MLLMs) based on predefined patterns, then iteratively generates enhanced offspring through MLLMs. During evolution, EvoStealer identifies common features across offspring to derive generalized templates. Our comprehensive evaluation conducted across open-source (INTERNVL2-26B) and closed-source models (GPT-4o and GPT-4o-mini) demonstrates that EvoStealer's stolen templates can reproduce images highly similar to originals and effectively generalize to other subjects, significantly outperforming baseline methods with an average improvement of over 10%. Moreover, our cost analysis reveals that EvoStealer achieves template stealing with negligible computational expenses. Our code and dataset are available at https://github.com/whitepagewu/evostealer.
中文: 本研究提出EvoStealer,一种无需模型微调即可通过差分进化算法从样本图像中窃取提示模板的新方法,其性能显著优于基线方法,平均提升超过10%,且计算成本极低。
English: This study introduces EvoStealer, a novel prompt-stealing method that uses differential evolution algorithms to extract prompt templates from sample images without model fine-tuning, demonstrating superior performance over baselines with over 10% improvement and minimal computational cost.

Authors:Hanlin Wang, Jian Wang, Chak Tou Leong, Wenjie Li
Title: STeCa: Step-level Trajectory Calibration for LLM Agent Learning
Abstract:
Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.
Chinese: 提出的STeCa框架通过步骤级奖励比较识别次优行动,并利用基于大语言模型的反思构建校准轨迹,有效解决了现有方法在长周期任务中的不足,显著提升了智能体的任务完成能力和鲁棒性。
English: The proposed STeCa framework addresses the limitations of existing LLM agent training methods by identifying suboptimal actions through step-level reward comparisons and constructing calibrated trajectories via LLM-driven reflection, significantly enhancing performance and robustness in long-horizon tasks.

Authors:Yupeng Chang, Yi Chang, Yuan Wu
Title: Transfer-Prompting: Enhancing Cross-Task Adaptation in Large Language Models via Dual-Stage Prompts Optimization
Abstract:
Large language models (LLMs) face significant challenges when balancing multiple high-level objectives, such as generating coherent, relevant, and high-quality responses while maintaining efficient task adaptation across diverse tasks. To address these challenges, we introduce Transfer-Prompting, a novel two-stage framework designed to enhance cross-task adaptation in prompt generation. The framework comprises two key components: (1) source prompt construction, which refines the original prompts on source task datasets to generate source prompts with enhanced generalization ability, and (2) target prompt generation, which enhances cross-task adaptation of target prompts by fine-tuning a set of high-scored source prompts on task-specific datasets. In each optimization cycle, a reference LLM generates candidate prompts based on historical prompt-score pairs and task descriptions in our designed reference prompt. These candidate prompts are refined iteratively, while a scorer LLM evaluates their effectiveness using the multi-dimensional metrics designed in the objective prompts evaluator-a novel contribution in this work that provides a holistic evaluation of prompt quality and task performance. This feedback loop facilitates continuous refinement, optimizing both prompt quality and task-specific outcomes. We validate Transfer-Prompting through extensive experiments across 25 LLMs, including 7 foundational models and 18 specialized models, evaluated on 9 diverse datasets. The results demonstrate that Transfer-Prompting significantly improves task-specific performance, highlighting its potential for enhancing cross-task adaptation in LLMs. The code is available at https://github.com/llm172/Transfer-Prompting.
中文摘要:Transfer-Prompting框架通过源提示构建和目标提示生成的两阶段设计,显著提升大语言模型的跨任务适应能力,在多模型和多数据集的实验中验证了其有效性。
English Summary: The Transfer-Prompting framework enhances cross-task adaptation in LLMs through a two-stage process of source prompt construction and target prompt generation, validated by significant performance improvements across multiple models and datasets.

Authors:Shokhrukh Ibragimov, Arnulf Jentzen, Benno Kuckuck
Title: On the logical skills of large language models: evaluations using arbitrarily complex first-order logic problems
Abstract:
We present a method of generating first-order logic statements whose complexity can be controlled along multiple dimensions. We use this method to automatically create several datasets consisting of questions asking for the truth or falsity of first-order logic statements in Zermelo-Fraenkel set theory. While the resolution of these questions does not require any knowledge beyond basic notation of first-order logic and set theory, it does require a degree of planning and logical reasoning, which can be controlled up to arbitrarily high difficulty by the complexity of the generated statements. Furthermore, we do extensive evaluations of the performance of various large language models, including recent models such as DeepSeek-R1 and OpenAI's o3-mini, on these datasets. All of the datasets along with the code used for generating them, as well as all data from the evaluations is publicly available at https://github.com/bkuckuck/logical-skills-of-llms.
Chinese: 本文提出了一种可控制复杂度的生成一阶逻辑语句的方法,并利用该方法创建数据集来评估包括DeepSeek-R1和OpenAI的o3-mini在内的多种大语言模型的逻辑推理能力。
English: This paper introduces a method for generating first-order logic statements with controllable complexity and uses it to create datasets for evaluating the logical reasoning abilities of large language models, including recent ones like DeepSeek-R1 and OpenAI's o3-mini.

Authors:Xuansheng Wu, Wenhao Yu, Xiaoming Zhai, Ninghao Liu
Title: Self-Regularization with Sparse Autoencoders for Controllable LLM-based Classification
Abstract:
Modern text classification methods heavily rely on contextual embeddings from large language models (LLMs). Compared to human-engineered features, these embeddings provide automatic and effective representations for classification model training. However, they also introduce a challenge: we lose the ability to manually remove unintended features, such as sensitive or task-irrelevant features, to guarantee regulatory compliance or improve the generalizability of classification models. This limitation arises because LLM embeddings are opaque and difficult to interpret. In this paper, we propose a novel framework to identify and regularize unintended features in the LLM latent space. Specifically, we first pre-train a sparse autoencoder (SAE) to extract interpretable features from LLM latent spaces. To ensure the SAE can capture task-specific features, we further fine-tune it on task-specific datasets. In training the classification model, we propose a simple and effective regularizer, by minimizing the similarity between the classifier weights and the identified unintended feature, to remove the impact of these unintended features on classification. We evaluate the proposed framework on three real-world tasks, including toxic chat detection, reward modeling, and disease diagnosis. Results show that the proposed self-regularization framework can improve the classifier's generalizability by regularizing those features that are not semantically correlated to the task. This work pioneers controllable text classification on LLM latent spaces by leveraging interpreted features to address generalizability, fairness, and privacy challenges. The code and data are publicly available at https://github.com/JacksonWuxs/Controllable_LLM_Classifier.
中文: 本文提出了一种新颖框架,通过稀疏自编码器识别并正则化大语言模型嵌入中的非预期特征,从而提升文本分类器的泛化能力并解决公平性和隐私问题。
English: This paper introduces a novel framework that uses a sparse autoencoder to identify and regularize unintended features in LLM embeddings, enhancing classifier generalizability and addressing fairness and privacy concerns in text classification.

Authors:Yueqing Liang, Liangwei Yang, Chen Wang, Congying Xia, Rui Meng, Xiongxiao Xu, Haoran Wang, Ali Payani, Kai Shu
Title: Benchmarking LLMs for Political Science: A United Nations Perspective
Abstract:
Large Language Models (LLMs) have achieved significant advances in natural language processing, yet their potential for high-stake political decision-making remains largely unexplored. This paper addresses the gap by focusing on the application of LLMs to the United Nations (UN) decision-making process, where the stakes are particularly high and political decisions can have far-reaching consequences. We introduce a novel dataset comprising publicly available UN Security Council (UNSC) records from 1994 to 2024, including draft resolutions, voting records, and diplomatic speeches. Using this dataset, we propose the United Nations Benchmark (UNBench), the first comprehensive benchmark designed to evaluate LLMs across four interconnected political science tasks: co-penholder judgment, representative voting simulation, draft adoption prediction, and representative statement generation. These tasks span the three stages of the UN decision-making process--drafting, voting, and discussing--and aim to assess LLMs' ability to understand and simulate political dynamics. Our experimental analysis demonstrates the potential and challenges of applying LLMs in this domain, providing insights into their strengths and limitations in political science. This work contributes to the growing intersection of AI and political science, opening new avenues for research and practical applications in global governance. The UNBench Repository can be accessed at: https://github.com/yueqingliang1/UNBench.
中文摘要:本文提出首个基于联合国安理会数据的综合评估基准UNBench,通过四项政治学任务系统评估大语言模型在政治决策中的能力,揭示了其在模拟高风险外交进程中的潜力与局限。
English Summary: This paper introduces UNBench, the first comprehensive benchmark using UN Security Council data to evaluate Large Language Models' capabilities in political decision-making tasks, revealing both their potential and limitations in simulating high-stakes diplomatic processes.

Authors:Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov
Title: RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
Abstract:
Transformer-based Large Language Models rely critically on the KV cache to efficiently handle extended contexts during the decode phase. Yet, the size of the KV cache grows proportionally with the input length, burdening both memory bandwidth and capacity as decoding progresses. To address this challenge, we present RocketKV, a training-free KV cache compression strategy containing two consecutive stages. In the first stage, it performs coarse-grain permanent KV cache eviction on the input sequence tokens. In the second stage, it adopts a hybrid sparse attention method to conduct fine-grain top-k sparse attention, approximating the attention scores by leveraging both head and sequence dimensionality reductions. We show that RocketKV provides a compression ratio of up to 400$\times$, end-to-end speedup of up to 3.7$\times$ as well as peak memory reduction of up to 32.6% in the decode phase on an NVIDIA A100 GPU compared to the full KV cache baseline, while achieving negligible accuracy loss on a variety of long-context tasks. We also propose a variant of RocketKV for multi-turn scenarios, which consistently outperforms other existing methods and achieves accuracy nearly on par with an oracle top-k attention scheme. The source code is available here: https://github.com/NVlabs/RocketKV.
中文: RocketKV是一种无需训练的KV缓存压缩方法,通过粗粒度淘汰和细粒度稀疏注意力,在长上下文任务中实现高达400倍压缩和3.7倍加速,同时保持近乎无损的精度。
English: RocketKV is a training-free KV cache compression method that employs coarse-grain eviction and fine-grain sparse attention, achieving up to 400× compression and 3.7× speedup with minimal accuracy loss in long-context tasks.

Authors:Masane Fuchi, Tomohiro Takagi
Title: Erasing with Precision: Evaluating Specific Concept Erasure from Text-to-Image Generative Models
Abstract:
Studies have been conducted to prevent specific concepts from being generated from pretrained text-to-image generative models, achieving concept erasure in various ways. However, the performance evaluation of these studies is still largely reliant on visualization, with the superiority of studies often determined by human subjectivity. The metrics of quantitative evaluation also vary, making comprehensive comparisons difficult. We propose EraseEval, an evaluation method that differs from previous evaluation methods in that it involves three fundamental evaluation criteria: (1) How well does the prompt containing the target concept be reflected, (2) To what extent the concepts related to the erased concept can reduce the impact of the erased concept, and (3) Whether other concepts are preserved. These criteria are evaluated and integrated into a single metric, such that a lower score is given if any of the evaluations are low, leading to a more robust assessment. We experimentally evaluated baseline concept erasure methods, organized their characteristics, and identified challenges with them. Despite being fundamental evaluation criteria, some concept erasure methods failed to achieve high scores, which point toward future research directions for concept erasure methods. Our code is available at https://github.com/fmp453/erase-eval.
中文: 本文提出EraseEval评估框架,通过将三个核心标准整合为单一指标来系统评估文本到图像模型的概念消除效果,实验发现现有方法在此标准下存在不足,为未来研究指明了方向。
English: This paper introduces EraseEval, a novel evaluation framework for concept erasure in text-to-image models that integrates three key criteria into a single robust metric, addressing limitations in current visualization-dependent assessments and revealing challenges in existing methods through experimental analysis.

Authors:William Jurayj, Jeffrey Cheng, Benjamin Van Durme
Title: Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
Abstract:
Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
中文摘要:提升大型语言模型的测试时计算量不仅能提高答案准确率,还能增强对正确答案的置信度,这促使我们建立包含响应风险阈值的新型评估体系。
English Summary: Increasing test-time compute in large language models not only improves accuracy but also boosts confidence in correct answers, prompting a new evaluation approach that incorporates response risk thresholds.

Authors:Reza Averly, Frazier N. Baker, Ian A. Watson, Xia Ning
Title: LIDDIA: Language-based Intelligent Drug Discovery Agent
Abstract:
Drug discovery is a long, expensive, and complex process, relying heavily on human medicinal chemists, who can spend years searching the vast space of potential therapies. Recent advances in artificial intelligence for chemistry have sought to expedite individual drug discovery tasks; however, there remains a critical need for an intelligent agent that can navigate the drug discovery process. Towards this end, we introduce LIDDIA, an autonomous agent capable of intelligently navigating the drug discovery process in silico. By leveraging the reasoning capabilities of large language models, LIDDIA serves as a low-cost and highly-adaptable tool for autonomous drug discovery. We comprehensively examine LIDDIA , demonstrating that (1) it can generate molecules meeting key pharmaceutical criteria on over 70% of 30 clinically relevant targets, (2) it intelligently balances exploration and exploitation in the chemical space, and (3) it identifies one promising novel candidate on AR/NR3C4, a critical target for both prostate and breast cancers. Code and dataset are available at https://github.com/ninglab/LIDDiA
中文:LIDDIA是一种自主人工智能代理,利用大型语言模型智能引导药物发现过程,在生成符合药物标准的分子和识别关键癌症靶点的新候选物方面表现出色。
English: LIDDIA is an autonomous AI agent that leverages large language models to intelligently navigate the drug discovery process, demonstrating high success in generating molecules meeting pharmaceutical criteria and identifying novel candidates for critical cancer targets.

Authors:Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang
Title: RAG-Gym: Systematic Optimization of Language Agents for Retrieval-Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) has shown great promise for knowledge-intensive tasks and recently advanced with agentic RAG, where language agents engage in multi-round interactions with external knowledge sources for adaptive information retrieval. However, existing agentic RAG methods often depend on ad-hoc prompt engineering and lack a unified optimization framework. We introduce RAG-Gym, a comprehensive platform that systematically explores three optimization dimensions: (1) prompt engineering, (2) actor tuning, and (3) critic training. For prompt engineering, we propose Re$^2$Search, a novel agent incorporating reasoning reflection that significantly outperforms standard prompts. In actor tuning, we evaluate three popular post-training algorithms with fine-grained process supervision and identify direct preference optimization as the most effective. We further demonstrate that a trained critic can enhance inference by selecting higher-quality intermediate reasoning steps. Together, these findings lead to the optimized Re$^2$Search++ agent, which surpasses most recent methods like Search-R1 by a relative increase of 3.2% to 11.6% in average F1. Finally, we examine the impact of different reward sources and analyze scaling properties in training and inference, offering practical insights for agentic RAG optimization. The project homepage is available at https://rag-gym.github.io.
Chinese: 摘要介绍了RAG-Gym平台,该平台通过提示工程、行动者调优和评判器训练来优化代理式检索增强生成,最终开发的Re$^2$Search++智能体在性能指标上显著超越了现有最新方法。
English: The abstract introduces RAG-Gym, a platform that optimizes agentic RAG through prompt engineering, actor tuning, and critic training, resulting in the enhanced Re$^2$Search++ agent which significantly outperforms recent methods in performance metrics.

Authors:Jingwang Huang, Jiang Zhong, Qin Lei, Jinpeng Gao, Yuming Yang, Sirui Wang, Peiguang Li, Kaiwen Wei
Title: Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition
Abstract:
Multimodal multi-label emotion recognition (MMER) aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies primarily focus on improving fusion strategies and modeling modality-to-label dependencies. However, they often overlook the impact of \textbf{aleatoric uncertainty}, which is the inherent noise in the multimodal data and hinders the effectiveness of modality fusion by introducing ambiguity into feature representations. To address this issue and effectively model aleatoric uncertainty, this paper proposes Latent emotional Distribution Decomposition with Uncertainty perception (LDDU) framework from a novel perspective of latent emotional space probabilistic modeling. Specifically, we introduce a contrastive disentangled distribution mechanism within the emotion space to model the multimodal data, allowing for the extraction of semantic features and uncertainty. Furthermore, we design an uncertainty-aware fusion multimodal method that accounts for the dispersed distribution of uncertainty and integrates distribution information. Experimental results show that LDDU achieves state-of-the-art performance on the CMU-MOSEI and M$^3$ED datasets, highlighting the importance of uncertainty modeling in MMER. Code is available at https://github.com/201983290498/lddu\_mmer.git.
中文摘要:LDDU框架通过潜在情感空间概率建模和不确定性感知融合方法,有效解决多模态多标签情感识别中的随机不确定性,在基准数据集上取得了最优性能。
English Summary: The LDDU framework addresses aleatoric uncertainty in multimodal multi-label emotion recognition by modeling latent emotional distributions and employing uncertainty-aware fusion, achieving state-of-the-art results on benchmark datasets.

Authors:Guanzheng Chen, Xin Li, Michael Qizhe Shieh, Lidong Bing
Title: LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization
Abstract:
Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transferring short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 512K context lengths, LongPO fully retains short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, LongPO-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales. Our code is available at https://github.com/DAMO-NLP-SG/LongPO.
中文: LongPO通过让短上下文大语言模型利用内部能力转移和自生成的偏好数据来自我进化,以胜任长上下文任务,在保持短上下文性能的同时,实现了与GPT-4-128K等先进模型相媲美甚至更优的长上下文表现。
English: LongPO enables short-context LLMs to self-evolve for long-context tasks by leveraging internal capability transfer and self-generated preference data, maintaining short-context performance while achieving superior results comparable to advanced models like GPT-4-128K.

Authors:Xingbo Wang, Janessa Griffith, Daniel A. Adler, Joey Castillo, Tanzeem Choudhury, Fei Wang
Title: Exploring Personalized Health Support through Data-Driven, Theory-Guided LLMs: A Case Study in Sleep Health
Abstract:
Despite the prevalence of sleep-tracking devices, many individuals struggle to translate data into actionable improvements in sleep health. Current methods often provide data-driven suggestions but may not be feasible and adaptive to real-life constraints and individual contexts. We present HealthGuru, a novel large language model-powered chatbot to enhance sleep health through data-driven, theory-guided, and adaptive recommendations with conversational behavior change support. HealthGuru's multi-agent framework integrates wearable device data, contextual information, and a contextual multi-armed bandit model to suggest tailored sleep-enhancing activities. The system facilitates natural conversations while incorporating data-driven insights and theoretical behavior change techniques. Our eight-week in-the-wild deployment study with 16 participants compared HealthGuru to a baseline chatbot. Results show improved metrics like sleep duration and activity scores, higher quality responses, and increased user motivation for behavior change with HealthGuru. We also identify challenges and design considerations for personalization and user engagement in health chatbots.
中文摘要:HealthGuru是一种新型的基于大语言模型的聊天机器人,通过个性化睡眠建议和对话式支持,在实际应用中显著改善了用户睡眠指标并提升了参与度。
English Summary: HealthGuru is a novel LLM-powered chatbot that provides personalized, adaptive sleep recommendations and conversational support, demonstrating improved sleep metrics and user engagement in real-world testing.

Authors:Jaesung Tae, Hamish Ivison, Sachin Kumar, Arman Cohan
Title: TESS 2: A Large-Scale Generalist Diffusion Language Model
Abstract:
We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at https://github.com/hamishivi/tess-2.
中文:TESS 2 是一种扩散语言模型,通过适应性训练和创新的奖励引导技术,在遵循指令方面优于同类扩散模型,并能与自回归模型相媲美,同时具备推理计算可控性。
English: TESS 2 is a diffusion language model that excels in following instructions, surpassing similar diffusion models and competing with autoregressive models through adaptation training and a novel reward guidance technique for output alignment.

Authors:Dan Zhang, Sining Zhoubian, Min Cai, Fengzu Li, Lekang Yang, Wei Wang, Tianjiao Dong, Ziniu Hu, Jie Tang, Yisong Yue
Title: DataSciBench: An LLM Agent Benchmark for Data Science
Abstract:
This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and straightforward evaluation metrics, which limits the scope of tasks that can be evaluated. In contrast, DataSciBench is constructed based on a more comprehensive and curated collection of natural and challenging prompts for uncertain ground truth and evaluation metrics. We develop a semi-automated pipeline for generating ground truth (GT) and validating evaluation metrics. This pipeline utilizes and implements an LLM-based self-consistency and human verification strategy to produce accurate GT by leveraging collected prompts, predefined task types, and aggregate functions (metrics). Furthermore, we propose an innovative Task - Function - Code (TFC) framework to assess each code execution outcome based on precisely defined metrics and programmatic rules. Our experimental framework involves testing 6 API-based models, 8 open-source general models, and 9 open-source code generation models using the diverse set of prompts we have gathered. This approach aims to provide a more comprehensive and rigorous evaluation of LLMs in data science, revealing their strengths and weaknesses. Experimental results demonstrate that API-based models outperform open-sourced models on all metrics and Deepseek-Coder-33B-Instruct achieves the highest score among open-sourced models. We release all code and data at https://github.com/THUDM/DataSciBench.
中文: 本文提出了DataSciBench,这是一个通过半自动化流程和任务-函数-代码框架来全面评估大语言模型在数据科学领域能力的新型基准测试,实验表明基于API的模型在所有指标上均优于开源模型。
English: This paper introduces DataSciBench, a novel benchmark designed to comprehensively assess Large Language Models' capabilities in data science through a semi-automated pipeline and a Task-Function-Code framework, revealing that API-based models consistently outperform open-source alternatives.

Authors:Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Yancheng Yuan, Dacheng Tao
Title: Enhancing Input-Label Mapping in In-Context Learning with Contrastive Decoding
Abstract:
Large language models (LLMs) excel at a range of tasks through in-context learning (ICL), where only a few task examples guide their predictions. However, prior research highlights that LLMs often overlook input-label mapping information in ICL, relying more on their pre-trained knowledge. To address this issue, we introduce In-Context Contrastive Decoding (ICCD), a novel method that emphasizes input-label mapping by contrasting the output distributions between positive and negative in-context examples. Experiments on 7 natural language understanding (NLU) tasks show that our ICCD method brings consistent and significant improvement (up to +1.8 improvement on average) upon 6 different scales of LLMs without requiring additional training. Our approach is versatile, enhancing performance with various demonstration selection methods, demonstrating its broad applicability and effectiveness. The code and scripts are released at https://github.com/Romainpkq/CD_ICL.
中文摘要:提出的上下文对比解码(ICCD)方法通过对比分析强化输入标签映射,无需额外训练即可在多类自然语言理解任务中持续提升大语言模型的性能表现。
English Summary: The proposed In-Context Contrastive Decoding (ICCD) method improves large language models' performance by emphasizing input-label mapping through contrastive analysis, achieving consistent gains across multiple NLU tasks without additional training.

Authors:Jusen Du, Weigao Sun, Disen Lan, Jiaxi Hu, Yu Cheng
Title: MoM: Linear Sequence Modeling with Mixture-of-Memories
Abstract:
Linear sequence modeling methods, such as linear attention, state space modeling, and linear RNNs, offer significant efficiency improvements by reducing the complexity of training and inference. However, these methods typically compress the entire input sequence into a single fixed-size memory state, which leads to suboptimal performance on recall-intensive downstream tasks. Drawing inspiration from neuroscience, particularly the brain's ability to maintain robust long-term memory while mitigating "memory interference", we introduce a novel architecture called Mixture-of-Memories (MoM). MoM utilizes multiple independent memory states, with a router network directing input tokens to specific memory states. This approach greatly enhances the overall memory capacity while minimizing memory interference. As a result, MoM performs exceptionally well on recall-intensive tasks, surpassing existing linear sequence modeling techniques. Despite incorporating multiple memory states, the computation of each memory state remains linear in complexity, allowing MoM to retain the linear-complexity advantage during training, while constant-complexity during inference. Our experimental results show that MoM significantly outperforms current linear sequence models on downstream language tasks, particularly recall-intensive tasks, and even achieves performance comparable to Transformer models. The code is released at https://github.com/OpenSparseLLMs/MoM and is also released as a part of https://github.com/OpenSparseLLMs/Linear-MoE.
Chinese: Mixture-of-Memories (MoM) 架构通过采用带路由网络的多个独立记忆状态,显著提升了线性序列模型在记忆密集型任务上的性能,同时保持了线性训练复杂度和常数推理复杂度的优势。
English: The Mixture-of-Memories (MoM) architecture enhances linear sequence models by employing multiple independent memory states with a router network, significantly improving performance on recall-intensive tasks while maintaining linear training and constant inference complexity.

Authors:Ruida Hu, Chao Peng, Xinchen Wang, Junjielong Xu, Cuiyun Gao
Title: Repo2Run: Automated Building Executable Environment for Code Repository at Scale
Abstract:
Scaling up executable code data is significant for improving language models' software engineering capability. The intricate nature of the process makes it labor-intensive, time-consuming and expert-knowledge-dependent to build a large number of executable code repositories, limiting the scalability of existing work based on running tests. The primary bottleneck lies in the automated building of test environments for different repositories, which is an essential yet underexplored task. To mitigate the gap, we introduce Repo2Run, the first LLM-based agent aiming at automating the building of executable test environments for any repositories at scale. Specifically, given a code repository, Repo2Run iteratively builds the Docker image, runs unit tests based on the feedback of the building, and synthesizes the Dockerfile until the entire pipeline is executed successfully. The resulting Dockerfile can then be used to create Docker container environments for running code and tests. We created a benchmark containing 420 Python repositories with unit tests for evaluation. The results illustrate that Repo2Run achieves an 86.0% success rate, outperforming SWE-agent by 77.0%. The resources of Repo2Run are available at https://github.com/bytedance/Repo2Run.
中文: 扩展可执行代码数据对提升语言模型的软件工程能力至关重要,Repo2Run作为首个基于大语言模型的代理,能自动为代码仓库构建测试环境,成功率达到86.0%,远超现有方法。
English: Scaling executable code data is crucial for enhancing language models' software engineering capabilities, and Repo2Run, an LLM-based agent, automates the building of test environments for repositories, achieving an 86.0% success rate and significantly outperforming existing methods.

Authors:Tim Baumgärtner, Ted Briscoe, Iryna Gurevych
Title: PeerQA: A Scientific Question Answering Dataset from Peer Reviews
Abstract:
We present PeerQA, a real-world, scientific, document-level Question Answering (QA) dataset. PeerQA questions have been sourced from peer reviews, which contain questions that reviewers raised while thoroughly examining the scientific article. Answers have been annotated by the original authors of each paper. The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP, as well as a subset of other scientific communities like Geoscience and Public Health. PeerQA supports three critical tasks for developing practical QA systems: Evidence retrieval, unanswerable question classification, and answer generation. We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks. Our experiments and analyses reveal the need for decontextualization in document-level retrieval, where we find that even simple decontextualization approaches consistently improve retrieval performance across architectures. On answer generation, PeerQA serves as a challenging benchmark for long-context modeling, as the papers have an average size of 12k tokens. Our code and data is available at https://github.com/UKPLab/peerqa.
中文: PeerQA是一个基于同行评审问题的科学文档级问答数据集,包含作者标注的答案,支持证据检索、不可回答问题分类和答案生成三大任务,并揭示了去语境化对提升检索性能的关键作用。
English: PeerQA is a scientific document-level QA dataset derived from peer review questions with author-annotated answers, supporting evidence retrieval, unanswerable question classification, and answer generation tasks while demonstrating the importance of decontextualization for retrieval performance.

Authors:DongGeon Lee, Hwanjo Yu
Title: REFIND at SemEval-2025 Task 3: Retrieval-Augmented Factuality Hallucination Detection in Large Language Models
Abstract:
Hallucinations in large language model (LLM) outputs severely limit their reliability in knowledge-intensive tasks such as question answering. To address this challenge, we introduce REFIND (Retrieval-augmented Factuality hallucINation Detection), a novel framework that detects hallucinated spans within LLM outputs by directly leveraging retrieved documents. As part of the REFIND, we propose the Context Sensitivity Ratio (CSR), a novel metric that quantifies the sensitivity of LLM outputs to retrieved evidence. This innovative approach enables REFIND to efficiently and accurately detect hallucinations, setting it apart from existing methods. In the evaluation, REFIND demonstrated robustness across nine languages, including low-resource settings, and significantly outperformed baseline models, achieving superior IoU scores in identifying hallucinated spans. This work highlights the effectiveness of quantifying context sensitivity for hallucination detection, thereby paving the way for more reliable and trustworthy LLM applications across diverse languages. Our code is available at https://github.com/oneonlee/REFIND.
中文摘要:REFIND框架通过检索文档和创新的语境敏感度比率指标,能有效检测大语言模型输出中的幻觉内容,在多语言环境下表现出强大性能并显著优于基线模型。
English Summary: The REFIND framework effectively detects hallucinations in LLM outputs by using retrieved documents and a novel Context Sensitivity Ratio metric, demonstrating robust performance across multiple languages and outperforming baseline models.

Authors:Guangwei Li, Yuansen Zhang, Yinggui Wang, Shoumeng Yan, Lei Wang, Tao Wei
Title: PRIV-QA: Privacy-Preserving Question Answering for Cloud Large Language Models
Abstract:
The rapid development of large language models (LLMs) is redefining the landscape of human-computer interaction, and their integration into various user-service applications is becoming increasingly prevalent. However, transmitting user data to cloud-based LLMs presents significant risks of data breaches and unauthorized access to personal identification information. In this paper, we propose a privacy preservation pipeline for protecting privacy and sensitive information during interactions between users and LLMs in practical LLM usage scenarios. We construct SensitiveQA, the first privacy open-ended question-answering dataset. It comprises 57k interactions in Chinese and English, encompassing a diverse range of user-sensitive information within the conversations. Our proposed solution employs a multi-stage strategy aimed at preemptively securing user information while simultaneously preserving the response quality of cloud-based LLMs. Experimental validation underscores our method's efficacy in balancing privacy protection with maintaining robust interaction quality. The code and dataset are available at https://github.com/ligw1998/PRIV-QA.
中文: 本文提出了一种隐私保护流程,能在用户与大型语言模型交互时有效防护敏感数据泄露,同时保持云端模型的应答质量,并通过新构建的多语言数据集SensitiveQA验证了其有效性。
English: This paper introduces a privacy preservation pipeline that effectively safeguards sensitive user data during interactions with cloud-based large language models while maintaining response quality, as validated through a newly constructed multilingual dataset called SensitiveQA.

Authors:Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Yang You, Guiming Xie, Xuejian Gong, Kunlong Zhou
Title: Train Small, Infer Large: Memory-Efficient LoRA Training for Large Language Models
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing with exceptional task generalization capabilities. Low-Rank Adaption (LoRA) offers a cost-effective fine-tuning solution, freezing the original model parameters and training only lightweight, low-rank adapter matrices. However, the memory footprint of LoRA is largely dominated by the original model parameters. To mitigate this, we propose LoRAM, a memory-efficient LoRA training scheme founded on the intuition that many neurons in over-parameterized LLMs have low training utility but are essential for inference. LoRAM presents a unique twist: it trains on a pruned (small) model to obtain pruned low-rank matrices, which are then recovered and utilized with the original (large) model for inference. Additionally, minimal-cost continual pre-training, performed by the model publishers in advance, aligns the knowledge discrepancy between pruned and original models. Our extensive experiments demonstrate the efficacy of LoRAM across various pruning strategies and downstream tasks. For a model with 70 billion parameters, LoRAM enables training on a GPU with only 20G HBM, replacing an A100-80G GPU for LoRA training and 15 GPUs for full fine-tuning. Specifically, QLoRAM implemented by structured pruning combined with 4-bit quantization, for LLaMA-3.1-70B (LLaMA-2-70B), reduces the parameter storage cost that dominates the memory usage in low-rank matrix training by 15.81$\times$ (16.95$\times$), while achieving dominant performance gains over both the original LLaMA-3.1-70B (LLaMA-2-70B) and LoRA-trained LLaMA-3.1-8B (LLaMA-2-13B). Code is available at https://github.com/junzhang-zj/LoRAM.
大语言模型得益于LoRA的高效微调,但其内存使用受限于原始参数,因此LoRAM提出在剪枝后的小模型上训练,通过恢复矩阵进行推理,以降低内存需求并保持性能。
Large Language Models benefit from LoRA's efficient fine-tuning, but its memory use is constrained by original parameters, so LoRAM introduces training on a pruned model to reduce memory demands while maintaining performance through recovered matrices for inference.

Authors:Jialin Ouyang
Title: TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation
Abstract:
Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.
Chinese Summary: 大型语言模型在面对无解数学题时常会自信地给出错误答案,TreeCut数据集通过系统生成不可解问题,揭示了GPT-4o等模型在零样本条件下最高达64%的幻觉产生率。
English Summary: Large language models frequently provide confident but incorrect answers to unsolvable math problems, as demonstrated by the TreeCut dataset, which reveals hallucination rates up to 64% in models like GPT-4o under specific conditions.

Authors:Vishal Dey, Xiao Hu, Xia Ning
Title: GeLLMO: Generalizing Large Language Models for Multi-property Molecule Optimization
Abstract:
Despite recent advancements, most computational methods for molecule optimization are constrained to single- or double-property optimization tasks and suffer from poor scalability and generalizability to novel optimization tasks. Meanwhile, Large Language Models (LLMs) demonstrate remarkable out-of-domain generalizability to novel tasks. To demonstrate LLMs' potential for molecule optimization, we introduce MuMOInstruct, the first high-quality instruction-tuning dataset specifically focused on complex multi-property molecule optimization tasks. Leveraging MuMOInstruct, we develop GeLLMOs, a series of instruction-tuned LLMs for molecule optimization. Extensive evaluations across 5 in-domain and 5 out-of-domain tasks demonstrate that GeLLMOs consistently outperform state-of-the-art baselines. GeLLMOs also exhibit outstanding zero-shot generalization to unseen tasks, significantly outperforming powerful closed-source LLMs. Such strong generalizability demonstrates the tremendous potential of GeLLMOs as foundational models for molecule optimization, thereby tackling novel optimization tasks without resource-intensive retraining. MuMOInstruct, models, and code are accessible through https://github.com/ninglab/GeLLMO.
中文: 本研究推出了首个多属性分子优化指令数据集MuMOInstruct,并开发了GeLLMOs模型,该模型在多项任务中超越现有方法,展现出卓越的零样本泛化能力,为复杂分子优化提供了无需重复训练的高效解决方案。
English: This study introduces MuMOInstruct, a dataset for multi-property molecule optimization, and develops GeLLMOs, an instruction-tuned LLM that outperforms existing methods and demonstrates strong zero-shot generalization to novel tasks, offering a resource-efficient solution for complex optimization challenges.

Authors:Yiming Zeng, Wanhao Yu, Zexin Li, Tao Ren, Yu Ma, Jinghan Cao, Xiyan Chen, Tingting Yu
Title: Bridging the Editing Gap in LLMs: FineEdit for Precise and Targeted Text Modifications
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing, demonstrating strong capabilities in tasks such as text generation, summarization, and reasoning. Recently, their potential for automating precise text editing tasks across specialized domains, such as programming code, LaTeX, and structured database languages, has gained attention. However, current state-of-the-art LLMs still struggle with executing precise, instruction-driven edits, particularly when structural accuracy and strict adherence to domain conventions are required. To address these challenges, we introduce InstrEditBench, an automated benchmark dataset comprising over 30,000 structured editing tasks spanning diverse domains, including Wikipedia articles, LaTeX documents, source code, and database languages. Using this benchmark, we develop FineEdit, a specialized editing model explicitly trained for accurate, context-aware text modifications. Experimental evaluations demonstrate that FineEdit outperforms state-of-the-art models, achieving improvements of approximately 10\% over Gemini models on single-turn edits, up to 30\% over Llama-3.2-3B, and exceeding Mistral-7B-OpenOrca performance by over 40\% on direct editing tasks. FineEdit also effectively generalizes to realistic multi-turn editing scenarios, highlighting its practical applicability. To facilitate further research and reproducibility, we release FineEdit at https://github.com/StuRinDQB/FineEdit} and https://huggingface.co/datasets/YimingZeng/FineEdit_bench.
中文: 大语言模型在精确文本编辑方面存在局限,为此研发的FineEdit模型在多种编辑任务中显著超越现有模型,展现出卓越的性能和实用性。
English: Large Language Models face challenges in precise text editing, leading to the development of FineEdit, a specialized model that significantly outperforms existing models across various domains and editing tasks.

Authors:Shi Yu, Zhiyuan Liu, Chenyan Xiong
Title: Craw4LLM: Efficient Web Crawling for LLM Pretraining
Abstract:
Web crawl is a main source of large language models' (LLMs) pretraining data, but the majority of crawled web pages are discarded in pretraining due to low data quality. This paper presents Craw4LLM, an efficient web crawling method that explores the web graph based on the preference of LLM pretraining. Specifically, it leverages the influence of a webpage in LLM pretraining as the priority score of the web crawler's scheduler, replacing the standard graph connectivity based priority. Our experiments on a web graph containing 900 million webpages from a commercial search engine's index demonstrate the efficiency of Craw4LLM in obtaining high-quality pretraining data. With just 21% URLs crawled, LLMs pretrained on Craw4LLM data reach the same downstream performances of previous crawls, significantly reducing the crawling waste and alleviating the burdens on websites. Our code is publicly available at https://github.com/cxcscmu/Craw4LLM.
Chinese: 本文提出Craw4LLM方法,通过根据网页在LLM预训练中的影响力设定抓取优先级,仅抓取21%的网页即可达到同等模型性能,显著减少了无效抓取。
English: This paper introduces Craw4LLM, an efficient web crawling method that prioritizes webpages based on their influence in LLM pretraining, reducing wasted crawls and achieving comparable model performance with only 21% of URLs crawled.

Authors:Yunpeng Xiao, Youpeng Zhao, Kai Shu
Title: Understanding and Tackling Label Errors in Individual-Level Nature Language Understanding
Abstract:
Natural language understanding (NLU) is a task that enables machines to understand human language. Some tasks, such as stance detection and sentiment analysis, are closely related to individual subjective perspectives, thus termed individual-level NLU. Previously, these tasks are often simplified to text-level NLU tasks, ignoring individual factors. This not only makes inference difficult and unexplainable but often results in a large number of label errors when creating datasets. To address the above limitations, we propose a new NLU annotation guideline based on individual-level factors. Specifically, we incorporate other posts by the same individual and then annotate individual subjective perspectives after considering all individual posts. We use this guideline to expand and re-annotate the stance detection and topic-based sentiment analysis datasets. We find that error rates in the samples were as high as 31.7\% and 23.3\%. We further use large language models to conduct experiments on the re-annotation datasets and find that the large language models perform well on both datasets after adding individual factors. Both GPT-4o and Llama3-70B can achieve an accuracy greater than 87\% on the re-annotation datasets. We also verify the effectiveness of individual factors through ablation studies. We call on future researchers to add individual factors when creating such datasets. Our re-annotation dataset can be found at https://github.com/24yearsoldstudent/Individual-NLU
中文: 本研究针对立场检测和主题情感分析等自然语言理解任务,提出了基于个体层面的标注新方法,发现原数据集的标注错误率高达31.7%和23.3%,并通过实验证明引入个体因素能显著提升大语言模型的准确率至87%以上。
English: This study introduces individual-level annotation guidelines for natural language understanding tasks like stance detection and sentiment analysis, revealing high error rates in existing datasets and demonstrating that incorporating individual factors significantly improves model performance.

Authors:Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
Title: MoBA: Mixture of Block Attention for Long-Context LLMs
Abstract:
Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.
中文摘要:提出的混合块注意力(MoBA)使大型语言模型能够自主决定注意力模式,在长上下文任务中表现优异,同时实现完整与稀疏注意力机制的高效切换。
English Summary: The proposed Mixture of Block Attention (MoBA) enables large language models to autonomously determine attention patterns, achieving superior performance on long-context tasks while efficiently switching between full and sparse attention mechanisms.

Authors:Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
Title: Rethinking Diverse Human Preference Learning through Principal Component Analysis
Abstract:
Understanding human preferences is crucial for improving foundation models and building personalized AI systems. However, preferences are inherently diverse and complex, making it difficult for traditional reward models to capture their full range. While fine-grained preference data can help, collecting it is expensive and hard to scale. In this paper, we introduce Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons without requiring fine-grained annotations. Our key insight is to represent human preferences as vectors and analyze them using Principal Component Analysis (PCA). By constructing a dataset of embedding differences between preferred and rejected responses, DRMs identify orthogonal basis vectors that capture distinct aspects of preference. These decomposed rewards can be flexibly combined to align with different user needs, offering an interpretable and scalable alternative to traditional reward models. We demonstrate that DRMs effectively extract meaningful preference dimensions (e.g., helpfulness, safety, humor) and adapt to new users without additional training. Our results highlight DRMs as a powerful framework for personalized and interpretable LLM alignment. Our code is available at https://github.com/amandaluof/DRMs.
中文: 本文提出分解奖励模型(DRMs),通过向量表征和主成分分析从二元比较中提取多样化人类偏好,为个性化大语言模型对齐提供可解释且可扩展的替代方案。
English: This paper introduces Decomposed Reward Models (DRMs), a novel approach that extracts diverse human preferences from binary comparisons using vector representations and PCA, providing an interpretable and scalable alternative to traditional reward models for personalized LLM alignment.

Authors:Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Yao Mu, Hongyuan Zhang, Wenqi Shao, Ping Luo
Title: Text2World: Benchmarking Large Language Models for Symbolic World Model Generation
Abstract:
Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions. Although LLMs have been extensively explored in the context of world modeling, prior studies encountered several challenges, including evaluation randomness, dependence on indirect metrics, and a limited domain scope. To address these limitations, we introduce a novel benchmark, Text2World, based on planning domain definition language (PDDL), featuring hundreds of diverse domains and employing multi-criteria, execution-based metrics for a more robust evaluation. We benchmark current LLMs using Text2World and find that reasoning models trained with large-scale reinforcement learning outperform others. However, even the best-performing model still demonstrates limited capabilities in world modeling. Building on these insights, we examine several promising strategies to enhance the world modeling capabilities of LLMs, including test-time scaling, agent training, and more. We hope that Text2World can serve as a crucial resource, laying the groundwork for future research in leveraging LLMs as world models. The project page is available at https://text-to-world.github.io/.
中文: 研究者提出基于PDDDL的Text2World基准,通过多维度执行指标解决大语言模型构建世界模型时的评估缺陷,发现强化学习训练的逻辑模型表现最优但仍存局限,同时提出扩展策略并为后续研究奠定基础。
English: Researchers introduce Text2World, a PDDL-based benchmark addressing evaluation challenges in LLM-generated world models, revealing that reinforcement learning-trained reasoning models perform best but still have limitations, while proposing enhancement strategies and establishing a foundation for future research.

Authors:Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
Title: Robust Adaptation of Large Multimodal Models for Retrieval Augmented Hateful Meme Detection
Abstract:
Hateful memes have become a significant concern on the Internet, necessitating robust automated detection systems. While Large Multimodal Models (LMMs) have shown promise in hateful meme detection, they face notable challenges like sub-optimal performance and limited out-of-domain generalization capabilities. Recent studies further reveal the limitations of both supervised fine-tuning (SFT) and in-context learning when applied to LMMs in this setting. To address these issues, we propose a robust adaptation framework for hateful meme detection that enhances in-domain accuracy and cross-domain generalization while preserving the general vision-language capabilities of LMMs. Analysis reveals that our approach achieves improved robustness under adversarial attacks compared to SFT models. Experiments on six meme classification datasets show that our approach achieves state-of-the-art performance, outperforming larger agentic systems. Moreover, our method generates higher-quality rationales for explaining hateful content compared to standard SFT, enhancing model interpretability. Code available at https://github.com/JingbiaoMei/RGCL
Chinese: 本文提出了一种鲁棒适应框架,通过提升领域内准确性和跨领域泛化能力来增强仇恨表情包检测,同时保留大型多模态模型的视觉语言能力,实现了最先进的性能并通过高质量解释提升了模型可解释性。
English: This paper introduces a robust adaptation framework that enhances hateful meme detection by improving in-domain accuracy and cross-domain generalization while preserving LMMs' vision-language capabilities, achieving state-of-the-art performance and superior interpretability through high-quality rationales.

Authors:Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
Title: HPSS: Heuristic Prompting Strategy Search for LLM Evaluators
Abstract:
Since the adoption of large language models (LLMs) for text evaluation has become increasingly prevalent in the field of natural language processing (NLP), a series of existing works attempt to optimize the prompts for LLM evaluators to improve their alignment with human judgment. However, their efforts are limited to optimizing individual factors of evaluation prompts, such as evaluation criteria or output formats, neglecting the combinatorial impact of multiple factors, which leads to insufficient optimization of the evaluation pipeline. Nevertheless, identifying well-behaved prompting strategies for adjusting multiple factors requires extensive enumeration. To this end, we comprehensively integrate 8 key factors for evaluation prompts and propose a novel automatic prompting strategy optimization method called Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm, HPSS conducts an iterative search to find well-behaved prompting strategies for LLM evaluators. A heuristic function is employed to guide the search process, enhancing the performance of our algorithm. Extensive experiments across four evaluation tasks demonstrate the effectiveness of HPSS, consistently outperforming both human-designed evaluation prompts and existing automatic prompt optimization methods. Our code is available at https://github.com/thu-coai/HPSS.
中文: 研究者提出HPSS方法,通过整合多个提示因素自动优化大语言模型评估策略,有效提升其与人类判断的一致性,在多项评估任务中表现优于现有方法。
English: Researchers propose HPSS, an automatic prompting strategy optimization method that integrates multiple factors to enhance LLM evaluators' alignment with human judgment, outperforming existing approaches across various tasks.

Authors:Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
Title: Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
Abstract:
Sailor2 is a family of cutting-edge multilingual language models for South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to support 13 SEA languages while retaining proficiency in Chinese and English. Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA languages. We also deliver a comprehensive cookbook on how to develop the multilingual model in an efficient manner, including five key aspects: data curation, pre-training, post-training, model customization and evaluation. We hope that Sailor2 model (Apache 2.0 license) will drive language development in the SEA region, and Sailor2 cookbook will inspire researchers to build more inclusive LLMs for other under-served languages.
中文: Sailor2是面向东南亚语言的先进多语言模型系列,在性能上可与GPT-4o媲美,并附带完整开发指南以促进包容性语言AI发展。
English: Sailor2 is a family of advanced multilingual models for Southeast Asian languages, achieving competitive performance against GPT-4o and including a comprehensive development guide to foster inclusive language AI.

Authors:Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
Title: A Survey of Text Classification Under Class Distribution Shift
Abstract:
The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e.~the distribution of the test data changes over time, which hinders the application of conventional ML models. One domain where the distribution shift naturally occurs is text classification, since people always find new topics to discuss. To this end, we survey research articles studying open-set text classification and related tasks. We divide the methods in this area based on the constraints that define the kind of distribution shift and the corresponding problem formulation, i.e.~learning with the Universum, zero-shot learning, and open-set learning. We next discuss the predominant mitigation approaches for each problem setup. Finally, we identify several future work directions, aiming to push the boundaries beyond the state of the art. Interestingly, we find that continual learning can solve many of the issues caused by the shifting class distribution. We maintain a list of relevant papers at https://github.com/Eduard6421/Open-Set-Survey.
中文: 本综述探讨了针对测试数据分布变化而设计的开放集文本分类方法,按问题约束和应对策略对方法进行分类,强调持续学习作为关键解决方案,并指出了未来的研究方向。
English: This survey examines open-set text classification methods that address distribution shifts in test data, categorizing approaches by problem constraints and mitigation strategies while highlighting continual learning as a key solution and identifying future research directions.

Authors:Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
Title: Trust Me, I'm Wrong: LLMs Hallucinate with Certainty Despite Knowing the Answer
Abstract:
Prior work on large language model (LLM) hallucinations has associated them with model uncertainty or inaccurate knowledge. In this work, we define and investigate a distinct type of hallucination, where a model can consistently answer a question correctly, but a seemingly trivial perturbation, which can happen in real-world settings, causes it to produce a hallucinated response with high certainty. This phenomenon, which we dub CHOKE (Certain Hallucinations Overriding Known Evidence), is particularly concerning in high-stakes domains such as medicine or law, where model certainty is often used as a proxy for reliability. We show that CHOKE examples are consistent across prompts, occur in different models and datasets, and are fundamentally distinct from other hallucinations. This difference leads existing mitigation methods to perform worse on CHOKE examples than on general hallucinations. Finally, we introduce a probing-based mitigation that outperforms existing methods on CHOKE hallucinations. These findings reveal an overlooked aspect of hallucinations, emphasizing the need to understand their origins and improve mitigation strategies to enhance LLM safety. The code is available at https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
中文摘要:本研究提出CHOKE现象,即大型语言模型在轻微输入扰动下会覆盖正确知识产生自信但错误的幻觉响应,这种在高风险领域尤为严重的新型幻觉与常规幻觉存在本质区别,且现有缓解方法对其效果有限。
English Summary: This study identifies CHOKE, a distinct type of LLM hallucination where minor input perturbations cause models to override correct knowledge with confident but wrong responses, particularly problematic in high-stakes domains and resistant to current mitigation methods.

Authors:Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
Title: Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text
Abstract:
Masked language modeling has become a widely adopted unsupervised technique to pre-train large language models (LLMs). However, the process of selecting tokens for masking is random, and the percentage of masked tokens is typically fixed for the entire training process. In this paper, we propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme. First, we harness task-specific knowledge about useful and harmful tokens in order to determine which tokens to mask. Second, we propose a cyclic decaying masking ratio, which corresponds to an anti-curriculum schedule (from hard to easy). We exemplify our novel task-informed anti-curriculum by masking (TIACBM) approach across three diverse downstream tasks: sentiment analysis, text classification by topic, and authorship attribution. Our findings suggest that TIACBM enhances the ability of the model to focus on key task-relevant features, contributing to statistically significant performance gains across tasks. We release our code at https://github.com/JarcaAndrei/TIACBM.
中文摘要:本文提出了一种基于任务信息的反课程掩码方法,通过动态调整掩码比例和选择策略,使模型能聚焦于任务关键特征,在多项自然语言处理任务中实现了显著性能提升。
English Summary: The paper introduces a task-informed anti-curriculum masking approach that dynamically adjusts masking ratios and token selection, significantly improving model performance across multiple NLP tasks by focusing on task-relevant features.

Authors:Lakshmi Nair, Ian Trase, Mark Kim
Title: Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options
Abstract:
We present a novel reasoning approach called Flow-of-Options (FoO), designed to address intrinsic biases in Large Language Models (LLMs). Flow-of-Options enables LLMs to systematically explore a diverse range of possibilities in their reasoning, as demonstrated by an FoO-based agentic framework developed for autonomously solving Machine Learning (ML) tasks. FoO enforces diversity in LLM solutions through compressed and interpretable task representations, resulting in improvements of 38.2% - 69.2% on standard data science tasks, and 37.4% - 47.9% on therapeutic chemistry tasks, as compared to state-of-the-art baselines. With an overall operation cost under $1 per task, our framework is well-suited for cost-sensitive applications. Going beyond tabular classification and regression, we show the broader applicability of our FoO-based agentic system to tasks such as reinforcement learning and image generation. Our code is open-sourced at: https://github.com/flagshippioneering/Flow-of-Options.
中文摘要:Flow-of-Options(FoO)是一种新颖的推理方法,通过系统探索多样化解决方案来减少大语言模型的内在偏差,在数据科学任务上实现38.2%-69.2%、在治疗化学任务上实现37.4%-47.9%的性能提升,且单任务成本低于1美元。
English Summary: Flow-of-Options (FoO) is a novel reasoning approach that mitigates LLM biases by systematically exploring diverse solution possibilities, achieving performance improvements of 38.2%-69.2% on data science tasks and 37.4%-47.9% on therapeutic chemistry tasks while costing under $1 per task.

Authors:Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
Title: Soundwave: Less is More for Speech-Text Alignment in LLMs
Abstract:
Existing end-to-end speech large language models (LLMs) usually rely on large-scale annotated data for training, while data-efficient training has not been discussed in depth. We focus on two fundamental problems between speech and text: the representation space gap and sequence length inconsistency. We propose Soundwave, which utilizes an efficient training strategy and a novel architecture to address these issues. Results show that Soundwave outperforms the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks, using only one-fiftieth of the training data. Further analysis shows that Soundwave still retains its intelligence during conversation. The project is available at https://github.com/FreedomIntelligence/Soundwave.
中文: Soundwave是一种数据高效的语音大语言模型,通过新颖架构解决了语音与文本的差异,仅用五十分之一的训练数据就超越了Qwen2-Audio,同时保持了对话智能。
English: Soundwave is a data-efficient speech LLM that bridges the speech-text gap with a novel architecture, outperforming Qwen2-Audio using only 1/50th of the training data while maintaining conversational intelligence.

Authors:Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
Title: S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
Abstract:
Recent studies have demonstrated the effectiveness of LLM test-time scaling. However, existing approaches to incentivize LLMs' deep thinking abilities generally require large-scale data or significant training efforts. Meanwhile, it remains unclear how to improve the thinking abilities of less powerful base models. In this work, we introduce S$^2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Specifically, we first initialize LLMs with iterative self-verification and self-correction behaviors through supervised fine-tuning on carefully curated data. The self-verification and self-correction skills are then further strengthened by both outcome-level and process-level reinforcement learning, with minimized resource requirements, enabling the model to adaptively refine its reasoning process during inference. Our results demonstrate that, with only 3.1k self-verifying and self-correcting behavior initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from 51.0\% to 81.6\%, outperforming models trained on an equivalent amount of long-CoT distilled data. Extensive experiments and analysis based on three base models across both in-domain and out-of-domain benchmarks validate the effectiveness of S$^2$R. Our code and data are available at https://github.com/NineAbyss/S2R.
中文: 本文提出S²R框架,通过教导模型在推理过程中自我验证与自我修正来增强大语言模型的推理能力,仅需少量训练数据即可显著提升准确率。
English: This paper introduces S²R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference, achieving significant accuracy improvements with minimal training data.

Authors:Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
Title: R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs
Abstract:
Recent studies have combined Large Language Models (LLMs) with Knowledge Graphs (KGs) to enhance reasoning, improving inference accuracy without additional training while mitigating hallucination. However, existing frameworks still suffer two practical drawbacks: they must be re-tuned whenever the KG or reasoning task changes, and they depend on a single, high-capacity LLM for reliable (i.e., trustworthy) reasoning. To address this, we introduce R2-KG, a plug-and-play, dual-agent framework that separates reasoning into two roles: an Operator (a low-capacity LLM) that gathers evidence and a Supervisor (a high-capacity LLM) that makes final judgments. This design is cost-efficient for LLM inference while still maintaining strong reasoning accuracy. Additionally, R2-KG employs an Abstention mechanism, generating answers only when sufficient evidence is collected from KG, which significantly enhances reliability. Experiments across five diverse benchmarks show that R2-KG consistently outperforms baselines in both accuracy and reliability, regardless of the inherent capability of LLMs used as the Operator. Further experiments reveal that the single-agent version of R2-KG, equipped with a strict self-consistency strategy, achieves significantly higher-than-baseline reliability with reduced inference cost but increased abstention rate in complex KGs. Our findings establish R2-KG as a flexible and cost-effective solution for KG-based reasoning, reducing reliance on high-capacity LLMs while ensuring trustworthy inference. The code is available at https://github.com/ekrxjwh2009/R2-KG/.
中文摘要:近期研究将大语言模型与知识图谱结合以提升推理准确性并减少幻觉,但存在适应性差和依赖高容量模型的问题,为此提出R2-KG双代理框架,通过角色分工和弃权机制实现高效可靠的推理。
English Summary: Recent research integrates LLMs with Knowledge Graphs to boost reasoning accuracy and reduce hallucinations, yet faces issues with adaptability and reliance on high-capacity models, prompting the development of R2-KG, a dual-agent framework that enhances cost-efficiency and reliability through role separation and an abstention mechanism.

Authors:Shengxiang Gao, Jey Han Lau, Jianzhong Qi
Title: Beyond Seen Data: Improving KBQA Generalization Through Schema-Guided Logical Form Generation
Abstract:
Knowledge base question answering (KBQA) aims to answer user questions in natural language using rich human knowledge stored in large KBs. As current KBQA methods struggle with unseen knowledge base elements at test time,we introduce SG-KBQA: a novel model that injects schema contexts into entity retrieval and logical form generation to tackle this issue. It uses the richer semantics and awareness of the knowledge base structure provided by schema contexts to enhance generalizability. We show that SG-KBQA achieves strong generalizability, outperforming state-of-the-art models on two commonly used benchmark datasets across a variety of test settings. Our source code is available at https://github.com/gaosx2000/SG_KBQA.
中文:SG-KBQA是一种新颖模型,通过将知识库模式上下文融入实体检索和逻辑形式生成,有效提升了知识库问答的泛化能力,在多种测试设置下超越了现有最优方法。
English: SG-KBQA is a novel model that enhances generalizability in knowledge base question answering by incorporating schema contexts into entity retrieval and logical form generation, outperforming state-of-the-art methods on benchmark datasets.

Authors:Yuanfan Li, Zhaohan Zhang, Chengzhengxu Li, Chao Shen, Xiaoming Liu
Title: Iron Sharpens Iron: Defending Against Attacks in Machine-Generated Text Detection with Adversarial Training
Abstract:
Machine-generated Text (MGT) detection is crucial for regulating and attributing online texts. While the existing MGT detectors achieve strong performance, they remain vulnerable to simple perturbations and adversarial attacks. To build an effective defense against malicious perturbations, we view MGT detection from a threat modeling perspective, that is, analyzing the model's vulnerability from an adversary's point of view and exploring effective mitigations. To this end, we introduce an adversarial framework for training a robust MGT detector, named GREedy Adversary PromoTed DefendER (GREATER). The GREATER consists of two key components: an adversary GREATER-A and a detector GREATER-D. The GREATER-D learns to defend against the adversarial attack from GREATER-A and generalizes the defense to other attacks. GREATER-A identifies and perturbs the critical tokens in embedding space, along with greedy search and pruning to generate stealthy and disruptive adversarial examples. Besides, we update the GREATER-A and GREATER-D synchronously, encouraging the GREATER-D to generalize its defense to different attacks and varying attack intensities. Our experimental results across 10 text perturbation strategies and 6 adversarial attacks show that our GREATER-D reduces the Attack Success Rate (ASR) by 0.67% compared with SOTA defense methods while our GREATER-A is demonstrated to be more effective and efficient than SOTA attack approaches. Codes and dataset are available in https://github.com/Liyuuuu111/GREATER.
中文摘要:本研究提出GREATER对抗训练框架,通过同步优化检测器防御能力和对抗样本生成策略,显著提升了机器生成文本检测的鲁棒性,在多种攻击场景下均优于现有方法。
English Summary: The study introduces GREATER, an adversarial training framework that enhances machine-generated text detection by simultaneously strengthening the detector against attacks and refining adversarial examples to improve robustness across various perturbation strategies.

Authors:Lu Yang, Jiajia Li, En Ci, Lefei Zhang, Zuchao Li, Ping Wang
Title: Label Drop for Multi-Aspect Relation Modeling in Universal Information Extraction
Abstract:
Universal Information Extraction (UIE) has garnered significant attention due to its ability to address model explosion problems effectively. Extractive UIE can achieve strong performance using a relatively small model, making it widely adopted. Extractive UIEs generally rely on task instructions for different tasks, including single-target instructions and multiple-target instructions. Single-target instruction UIE enables the extraction of only one type of relation at a time, limiting its ability to model correlations between relations and thus restricting its capability to extract complex relations. While multiple-target instruction UIE allows for the extraction of multiple relations simultaneously, the inclusion of irrelevant relations introduces decision complexity and impacts extraction accuracy. Therefore, for multi-relation extraction, we propose LDNet, which incorporates multi-aspect relation modeling and a label drop mechanism. By assigning different relations to different levels for understanding and decision-making, we reduce decision confusion. Additionally, the label drop mechanism effectively mitigates the impact of irrelevant relations. Experiments show that LDNet outperforms or achieves competitive performance with state-of-the-art systems on 9 tasks, 33 datasets, in both single-modal and multi-modal, few-shot and zero-shot settings.\footnote{https://github.com/Lu-Yang666/LDNet}
中文: 提出的LDNet通过多角度关系建模和标签丢弃机制,有效减少决策混淆并降低无关关系影响,在多种任务和场景下展现出优于或媲美先进系统的性能。
English: LDNet is proposed to enhance multi-relation extraction by employing multi-aspect relation modeling and a label drop mechanism, which reduces decision confusion and mitigates irrelevant relation impacts, demonstrating superior or competitive performance across diverse tasks and settings.

Authors:Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
Title: G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation
Abstract:
Explainable recommendation has demonstrated significant advantages in informing users about the logic behind recommendations, thereby increasing system transparency, effectiveness, and trustworthiness. To provide personalized and interpretable explanations, existing works often combine the generation capabilities of large language models (LLMs) with collaborative filtering (CF) information. CF information extracted from the user-item interaction graph captures the user behaviors and preferences, which is crucial for providing informative explanations. However, due to the complexity of graph structure, effectively extracting the CF information from graphs still remains a challenge. Moreover, existing methods often struggle with the integration of extracted CF information with LLMs due to its implicit representation and the modality gap between graph structures and natural language explanations. To address these challenges, we propose G-Refer, a framework using graph retrieval-augmented large language models (LLMs) for explainable recommendation. Specifically, we first employ a hybrid graph retrieval mechanism to retrieve explicit CF signals from both structural and semantic perspectives. The retrieved CF information is explicitly formulated as human-understandable text by the proposed graph translation and accounts for the explanations generated by LLMs. To bridge the modality gap, we introduce knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of LLMs to process and utilize the retrieved CF information to generate explanations. Extensive experiments show that G-Refer achieves superior performance compared with existing methods in both explainability and stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
中文: G-Refer框架通过混合图检索机制提取显式协同过滤信号,并借助知识剪枝和微调技术将其与大语言模型融合,在可解释性和稳定性方面实现了优越性能。
English: The G-Refer framework enhances explainable recommendations by using a hybrid graph retrieval mechanism to extract explicit collaborative filtering signals and integrating them with large language models through knowledge pruning and fine-tuning, achieving superior performance in explainability and stability.

Authors:Kaiyang Wan, Honglin Mu, Rui Hao, Haoran Luo, Tianle Gu, Xiuying Chen
Title: A Cognitive Writing Perspective for Constrained Long-Form Text Generation
Abstract:
Like humans, Large Language Models (LLMs) struggle to generate high-quality long-form text that adheres to strict requirements in a single pass. This challenge is unsurprising, as successful human writing, according to the Cognitive Writing Theory, is a complex cognitive process involving iterative planning, translating, reviewing, and monitoring. Motivated by these cognitive principles, we aim to equip LLMs with human-like cognitive writing capabilities through CogWriter, a novel training-free framework that transforms LLM constrained long-form text generation into a systematic cognitive writing paradigm. Our framework consists of two key modules: (1) a Planning Agent that performs hierarchical planning to decompose the task, and (2) multiple Generation Agents that execute these plans in parallel. The system maintains quality via continuous monitoring and reviewing mechanisms, which evaluate outputs against specified requirements and trigger necessary revisions. CogWriter demonstrates exceptional performance on LongGenBench, a benchmark for complex constrained long-form text generation. Even when using Qwen-2.5-14B as its backbone, CogWriter surpasses GPT-4o by 22% in complex instruction completion accuracy while reliably generating texts exceeding 10,000 words. We hope this cognitive science-inspired approach provides a paradigm for LLM writing advancements: \href{https://github.com/KaiyangWan/CogWriter}{CogWriter}.
中文: CogWriter提出了一种无需训练的框架,通过模拟人类认知写作过程,结合分层规划、并行生成和持续监控,使大语言模型在复杂长文本生成中表现出色,其准确性和文本长度均大幅超越GPT-4o。
English: CogWriter introduces a training-free framework that mimics human cognitive writing processes, enabling LLMs to excel in complex long-form text generation by integrating hierarchical planning, parallel generation, and continuous monitoring, significantly outperforming GPT-4o in accuracy and length.

Authors:Weikai Lu, Hao Peng, Huiping Zhuang, Cen Chen, Ziqian Zeng
Title: SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
Abstract:
Multimodal Large Language Models (MLLMs) have serious security vulnerabilities.While safety alignment using multimodal datasets consisting of text and data of additional modalities can effectively enhance MLLM's security, it is costly to construct these datasets. Existing low-resource security alignment methods, including textual alignment, have been found to struggle with the security risks posed by additional modalities. To address this, we propose Synthetic Embedding augmented safety Alignment (SEA), which optimizes embeddings of additional modality through gradient updates to expand textual datasets. This enables multimodal safety alignment training even when only textual data is available. Extensive experiments on image, video, and audio-based MLLMs demonstrate that SEA can synthesize a high-quality embedding on a single RTX3090 GPU within 24 seconds. SEA significantly improves the security of MLLMs when faced with threats from additional modalities. To assess the security risks introduced by video and audio, we also introduced a new benchmark called VA-SafetyBench. High attack success rates across multiple MLLMs validate its challenge. Our code and data will be available at https://github.com/ZeroNLP/SEA.
Chinese: 提出的合成嵌入增强安全对齐(SEA)方法通过仅使用文本数据优化多模态嵌入,有效提升多模态大语言模型的安全性,能以较低计算成本应对跨模态威胁。
English: The proposed Synthetic Embedding augmented safety Alignment (SEA) method enhances multimodal large language models' security by optimizing embeddings for additional modalities using only textual data, effectively countering cross-modal threats while being computationally efficient.

Authors:Liangying Shao, Yanfu Yan, Denys Poshyvanyk, Jinsong Su
Title: UniGenCoder: Merging Seq2Seq and Seq2Tree Paradigms for Unified Code Generation
Abstract:
Deep learning-based code generation has completely transformed the way developers write programs today. Existing approaches to code generation have focused either on the Sequence-to-Sequence paradigm, which generates target code as a sequence of tokens, or the Sequence-to-Tree paradigm, which outputs code as a sequence of actions. While these two paradigms are intuitively complementary, their combination has not been previously explored. By comparing the code generated under these two paradigms, we find that integrating them holds significant potential. In this paper, we propose UniGenCoder for code-related generation tasks, which consists of a shared encoder, a shared decoder with a minimal set of additional parameters to unify two paradigms, and a selector that dynamically chooses optimal paradigm for each instance. Also, during the model training, we first perform the multi-task learning and distillation strategies to facilitate knowledge transfer between two paradigms, and then leverage contrastive learning to train the selector. Experimental results on the text-to-code and code-to-code generation tasks demonstrate the effectiveness of our proposed model. We release our code at https://github.com/DeepLearnXMU/UniGenCoder.
Chinese: 本文提出UniGenCoder模型,通过共享编码器和解码器结合动态选择器,首次统一了序列到序列与序列到树两种代码生成范式,并采用多任务学习和对比学习策略,在文本到代码和代码到代码任务中验证了其有效性。
English: This paper introduces UniGenCoder, a novel model that unifies the Sequence-to-Sequence and Sequence-to-Tree paradigms for code generation, employing multi-task learning, distillation, and contrastive learning to enhance performance across text-to-code and code-to-code tasks.

Authors:Xiaoqian Liu, Ke Wang, Yongbin Li, Yuchuan Wu, Wentao Ma, Aobo Kong, Fei Huang, Jianbin Jiao, Junge Zhang
Title: EPO: Explicit Policy Optimization for Strategic Reasoning in LLMs via Reinforcement Learning
Abstract:
Large Language Models (LLMs) have shown impressive reasoning capabilities in well-defined problems with clear solutions, such as mathematics and coding. However, they still struggle with complex real-world scenarios like business negotiations, which require strategic reasoning-an ability to navigate dynamic environments and align long-term goals amidst uncertainty. Existing methods for strategic reasoning face challenges in adaptability, scalability, and transferring strategies to new contexts. To address these issues, we propose explicit policy optimization (EPO) for strategic reasoning, featuring an LLM that provides strategies in open-ended action space and can be plugged into arbitrary LLM agents to motivate goal-directed behavior. To improve adaptability and policy transferability, we train the strategic reasoning model via multi-turn reinforcement learning (RL),utilizing process rewards and iterative self-play. Experiments across social and physical domains demonstrate EPO's ability of long-term goal alignment through enhanced strategic reasoning, achieving state-of-the-art performance on social dialogue and web navigation tasks. Our findings reveal various collaborative reasoning mechanisms emergent in EPO and its effectiveness in generating novel strategies, underscoring its potential for strategic reasoning in real-world applications. Code and data are available at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/EPO.
中文摘要:提出的显式策略优化(EPO)方法通过多轮强化学习增强大语言模型的战略推理能力,使其在社交对话和网页导航等复杂现实场景中展现出卓越的适应性和顶尖性能。
English Summary: The proposed Explicit Policy Optimization (EPO) method enhances LLMs' strategic reasoning through multi-turn reinforcement learning, enabling superior adaptability and state-of-the-art performance in complex real-world scenarios like social dialogue and web navigation.

Authors:Anjiang Wei, Jiannan Cao, Ran Li, Hongyu Chen, Yuhui Zhang, Ziheng Wang, Yuan Liu, Thiago S. F. X. Teixeira, Diyi Yang, Ke Wang, Alex Aiken
Title: EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking
Abstract:
As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e., determining whether two programs produce identical outputs for all possible inputs. Unlike prior code generation benchmarks, this task directly tests a model's ability to reason about program semantics. EquiBench consists of 2400 program pairs across four languages and six categories. These pairs are generated through program analysis, compiler scheduling, and superoptimization, ensuring high-confidence labels, nontrivial difficulty, and full automation. We evaluate 19 state-of-the-art LLMs and find that in the most challenging categories, the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline. Further analysis reveals that models often rely on syntactic similarity rather than exhibiting robust reasoning about program semantics, highlighting current limitations. Our code and dataset are publicly available at https://github.com/Anjiang-Wei/equibench
中文: EquiBench作为通过等价性检查评估大语言模型程序语义理解能力的新基准,揭示了模型常依赖语法相似性而非深层语义推理的局限性。
English: EquiBench is a novel benchmark that assesses large language models' understanding of program semantics through equivalence checking, revealing their limited reasoning capabilities as they often rely on syntactic cues rather than deep semantic analysis.

Authors:Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal
Title: Multi-Attribute Steering of Language Models via Targeted Intervention
Abstract:
Inference-time intervention (ITI) has emerged as a promising method for steering large language model (LLM) behavior in a particular direction (e.g., improving helpfulness) by intervening on token representations without costly updates to the LLM's parameters. However, existing ITI approaches fail to scale to multi-attribute settings with conflicts, such as enhancing helpfulness while also reducing toxicity. To address this, we introduce Multi-Attribute Targeted Steering (MAT-Steer), a novel steering framework designed for selective token-level intervention across multiple attributes. MAT-Steer learns steering vectors using an alignment objective that shifts the model's internal representations of undesirable outputs closer to those of desirable ones while enforcing sparsity and orthogonality among vectors for different attributes, thereby reducing inter-attribute conflicts. We evaluate MAT-Steer in two distinct settings: (i) on question answering (QA) tasks where we balance attributes like truthfulness, bias, and toxicity; (ii) on generative tasks where we simultaneously improve attributes like helpfulness, correctness, and coherence. MAT-Steer outperforms existing ITI and parameter-efficient fine-tuning approaches across both task types (e.g., 3% average accuracy gain across QA tasks and 55.82% win rate against the best ITI baseline).
中文: MAT-Steer是一种新颖的引导框架,通过稀疏正交的引导向量对多属性进行选择性干预以减少冲突,在问答和生成任务中均优于现有方法。
English: MAT-Steer is a novel framework that enables selective intervention on multiple attributes in large language models by learning sparse, orthogonal steering vectors to reduce conflicts, outperforming existing methods in both QA and generative tasks.

Authors:Norman Mu, Jonathan Lu, Michael Lavery, David Wagner
Title: A Closer Look at System Prompt Robustness
Abstract:
System prompts have emerged as a critical control surface for specifying the behavior of LLMs in chat and agent settings. Developers depend on system prompts to specify important context, output format, personalities, guardrails, content policies, and safety countermeasures, all of which require models to robustly adhere to the system prompt, especially when facing conflicting or adversarial user inputs. In practice, models often forget to consider relevant guardrails or fail to resolve conflicting demands between the system and the user. In this work, we study various methods for improving system prompt robustness by creating realistic new evaluation and fine-tuning datasets based on prompts collected from from OpenAI's GPT Store and HuggingFace's HuggingChat. Our experiments assessing models with a panel of new and existing benchmarks show that performance can be considerably improved with realistic fine-tuning data, as well as inference-time interventions such as classifier-free guidance. Finally, we analyze the results of recently released reasoning models from OpenAI and DeepSeek, which show exciting but uneven improvements on the benchmarks we study. Overall, current techniques fall short of ensuring system prompt robustness and further study is warranted.
中文: 系统提示对于控制大语言模型行为至关重要,但模型在面临冲突输入时常常无法遵循,研究通过微调和推理干预提升了鲁棒性,虽取得进展但效果仍不稳定。
English: System prompts are essential for controlling LLM behavior, yet models often fail to adhere to them under conflicting inputs, prompting research into improved robustness through fine-tuning and inference interventions that show promising but inconsistent results.

Authors:Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan
Title: MUDDFormer: Breaking Residual Bottlenecks in Transformers via Multiway Dynamic Dense Connections
Abstract:
We propose MUltiway Dynamic Dense (MUDD) connections, a simple yet effective method to address the limitations of residual connections and enhance cross-layer information flow in Transformers. Unlike existing dense connection approaches with static and shared connection weights, MUDD generates connection weights dynamically depending on hidden states at each sequence position and for each decoupled input stream (the query, key, value or residual) of a Transformer block. MUDD connections can be seamlessly integrated into any Transformer architecture to create MUDDFormer. Extensive experiments show that MUDDFormer significantly outperforms Transformers across various model architectures and scales in language modeling, achieving the performance of Transformers trained with 1.8X-2.4X compute. Notably, MUDDPythia-2.8B matches Pythia-6.9B in pretraining ppl and downstream tasks and even rivals Pythia-12B in five-shot settings, while adding only 0.23% parameters and 0.4% computation. Code in JAX and PyTorch and pre-trained models are available at https://github.com/Caiyun-AI/MUDDFormer .
中文:MUDD连接提出了一种动态方法,通过生成位置特定和输入流依赖的权重来增强Transformer中的跨层信息流,以极少的参数和计算量显著提升了语言建模性能。
English: MUDD connections introduce a dynamic method to enhance cross-layer information flow in Transformers by generating position-specific and input-stream-dependent weights, significantly improving performance in language modeling with minimal added parameters and computation.

Authors:Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
Title: Idiosyncrasies in Large Language Models
Abstract:
In this work, we unveil and study idiosyncrasies in Large Language Models (LLMs) -- unique patterns in their outputs that can be used to distinguish the models. To do so, we consider a simple classification task: given a particular text output, the objective is to predict the source LLM that generates the text. We evaluate this synthetic task across various groups of LLMs and find that simply fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on held-out validation data in the five-way classification problem involving ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals that these idiosyncrasies are rooted in word-level distributions. These patterns persist even when the texts are rewritten, translated, or summarized by an external LLM, suggesting that they are also encoded in the semantic content. Additionally, we leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the broader implications of our findings, including training on synthetic data, inferring model similarity, and robust evaluation of LLMs. Code is available at https://github.com/locuslab/llm-idiosyncrasies.
中文: 本研究揭示了大型语言模型输出中的独特模式,能够准确识别生成来源,并发现这些特征在文本改写后依然存在且编码于语义内容中。
English: This study identifies unique patterns in Large Language Models' outputs that enable accurate source model classification, revealing these idiosyncrasies persist through text modifications and are embedded in semantic content.

Authors:Sayantan Adak, Pauras Mangesh Meher, Paramita Das, Animesh Mukherjee
Title: REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives
Abstract:
Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia's B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique -- REVerSum -- we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVerSum generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5\% in terms of informativeness. Code and Data are available at: https://github.com/sayantan11995/wikipedia_enrichment
中文: 本研究提出REVerSum方法,通过整合个人叙事来丰富维基百科B类和C类人物传记条目,使其可整合性提升17%,信息量增加28.5%,显著优于现有基线。
English: This study introduces REVerSum, a retrieval-augmented generation method that enhances Wikipedia's B and C category biography articles by incorporating personal narratives, significantly improving their integrability by 17% and informativeness by 28.5% over baselines.

Authors:Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Title: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Abstract:
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous-space reasoning, they often require full-model fine-tuning and suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zero-shot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the LLM. Specifically, we employ a lightweight fixed assistant model to speculatively generate instance-specific soft thought tokens as the initial chain of thoughts, which are then mapped into the LLM's representation space via a trainable projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning. Source code is available at https://github.com/xuyige/SoftCoT.
中文: 提出的SoftCoT方法通过轻量级助手生成连续的软思考标记,并将其映射到大型语言模型的表示空间中,无需修改模型即可通过高效微调提升推理性能。
English: The proposed SoftCoT method enhances large language models' reasoning by using a lightweight assistant to generate continuous soft thought tokens, which are then projected into the model's space for efficient fine-tuning without full-model modifications.

Authors:Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, Yongfeng Zhang
Title: A-MEM: Agentic Memory for LLM Agents
Abstract:
While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems' fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/A-mem, while the source code of the agentic memory system is available at https://github.com/WujiangXu/A-mem-sys.
中文: 本文提出了一种基于Zettelkasten方法的智能记忆系统,通过动态索引和链接构建互联知识网络,使LLM代理能够实现记忆的持续演进,在实验中展现出优于现有方法的性能。
English: This paper introduces an agentic memory system for LLM agents that dynamically organizes and interconnects memories using Zettelkasten principles, enabling continuous evolution and superior performance over existing methods.

Authors:Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
Title: APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs
Abstract:
While long-context inference is crucial for advancing large language model (LLM) applications, its prefill speed remains a significant bottleneck. Current approaches, including sequence parallelism strategies and compute reduction through approximate attention mechanisms, still fall short of delivering optimal inference efficiency. This hinders scaling the inputs to longer sequences and processing long-context queries in a timely manner. To address this, we introduce APB, an efficient long-context inference framework that leverages multi-host approximate attention to enhance prefill speed by reducing compute and enhancing parallelism simultaneously. APB introduces a communication mechanism for essential key-value pairs within a sequence parallelism framework, enabling a faster inference speed while maintaining task performance. We implement APB by incorporating a tailored FlashAttn kernel alongside optimized distribution strategies, supporting diverse models and parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively, without any observable task performance degradation. We provide the implementation and experiment code of APB in https://github.com/thunlp/APB.
中文: APB是一种高效的长上下文推理框架,通过多主机近似注意力和优化并行机制显著提升预填充速度,在保持任务性能的同时实现高达9.2倍的加速效果。
English: APB is an efficient long-context inference framework that accelerates prefill speed through multi-host approximate attention and optimized parallelism, achieving up to 9.2x faster inference without performance loss.

Authors:En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
Title: Unhackable Temporal Rewarding for Scalable Video MLLMs
Abstract:
In the pursuit of superior video-processing MLLMs, we have encountered a perplexing paradox: the "anti-scaling law", where more data and larger models lead to worse performance. This study unmasks the culprit: "temporal hacking", a phenomenon where models shortcut by fixating on select frames, missing the full video narrative. In this work, we systematically establish a comprehensive theory of temporal hacking, defining it from a reinforcement learning perspective, introducing the Temporal Perplexity (TPL) score to assess this misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework to mitigate the temporal hacking. Both theoretically and empirically, TPL proves to be a reliable indicator of temporal modeling quality, correlating strongly with frame activation patterns. Extensive experiments reveal that UTR not only counters temporal hacking but significantly elevates video comprehension capabilities. This work not only advances video-AI systems but also illuminates the critical importance of aligning proxy rewards with true objectives in MLLM development.
中文总结:本研究揭示了视频处理多模态大模型中导致性能退化的"时间黑客"现象,提出了时间困惑度指标和不可黑客时间奖励框架,不仅能有效应对该问题,还显著提升了视频理解能力。
English Summary: This study identifies "temporal hacking" as the cause of performance degradation in video-processing MLLMs and introduces the Temporal Perplexity metric and Unhackable Temporal Rewarding framework to effectively counteract this issue while enhancing video comprehension.

Authors:Heming Xia, Chak Tou Leong, Wenjie Wang, Yongqi Li, Wenjie Li
Title: TokenSkip: Controllable Chain-of-Thought Compression in LLMs
Abstract:
Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning capabilities of large language models (LLMs). Recent advancements, such as OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT sequences during inference could further boost LLM reasoning performance. However, due to the autoregressive nature of LLM decoding, longer CoT outputs lead to a linear increase in inference latency, adversely affecting user experience, particularly when the CoT exceeds 10,000 tokens. To address this limitation, we analyze the semantic importance of tokens within CoT outputs and reveal that their contributions to reasoning vary. Building on this insight, we propose TokenSkip, a simple yet effective approach that enables LLMs to selectively skip less important tokens, allowing for controllable CoT compression. Extensive experiments across various models and tasks demonstrate the effectiveness of TokenSkip in reducing CoT token usage while preserving strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct, TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less than a 0.4% performance drop. We release our code and checkpoints in https://github.com/hemingkx/TokenSkip.
中文: TokenSkip 是一种创新方法,通过选择性跳过推理链中重要性较低的标记来实现可控压缩,在保持各类大语言模型任务性能的同时显著降低推理延迟。
English: TokenSkip is an innovative method that selectively compresses Chain-of-Thought sequences by skipping less important tokens, significantly reducing inference latency while maintaining reasoning performance across various LLMs and tasks.

Authors:Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
Title: Atom of Thoughts for Markov LLM Test-Time Scaling
Abstract:
Large Language Models (LLMs) achieve superior performance through training-time scaling, and test-time scaling further enhances their capabilities by conducting effective reasoning during inference. However, as the scale of reasoning increases, existing test-time scaling methods suffer from accumulated historical information, which not only wastes computational resources but also interferes with effective reasoning. To address this issue, we observe that complex reasoning can be achieved by solving a series of independent and self-contained subquestions. These subquestions are essentially \textit{atomic questions}, exhibiting the memoryless property similar to Markov processes. Based on this observation, we propose Atom of Thoughts (\our), where each state transition consists of decomposing the current question into a dependency-based directed acyclic graph and contracting its subquestions, forming a simplified question that maintains answer equivalence with the original problem. This answer preservation enables the iterative \textit{decomposition-contraction} process to naturally form a meaningful Markov reasoning process. Furthermore, these atomic states can be seamlessly integrated into existing test-time scaling methods, enabling \our to serve as a plug-in enhancement for improving reasoning capabilities. Experiments across six benchmarks demonstrate the effectiveness of \our both as a standalone framework and a plug-in enhancement. Notably, on HotpotQA, when applied to gpt-4o-mini, \our achieves an \textbf{80.6\%} F1 score, surpassing o3-mini by \textbf{3.4\%} and DeepSeek-R1 by \textbf{10.6\%}. The code is available at \href{https://github.com/qixucen/atom}{https://github.com/qixucen/atom}.
Chinese: 提出的Atom of Thoughts (AoT)方法通过将复杂问题分解为原子子问题并压缩为简化形式,有效提升大语言模型的推理能力,在多个基准测试中作为独立框架和插件增强均表现出卓越性能。
English: The proposed Atom of Thoughts (AoT) method enhances reasoning in Large Language Models by decomposing complex questions into atomic subquestions and contracting them into simplified forms, achieving superior performance across benchmarks as both a standalone framework and plug-in enhancement.

Authors:Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Hongyu Zhou, Jianjian Sun, Brian Li, Chengting Feng, Changyi Wan, Hanpeng Hu, Jianchang Wu, Jiangjie Zhen, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Hongyuan Wang, Kang An, Wei Ji, Wen Li, Xuan Wen, Xiangwen Kong, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Junjing Guo, Jiashuai Liu, Jiahong Liu, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Liang Zhao, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingliang Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Ran Sun, Shuai Shuai, Shaoliang Pang, Shiliang Yang, Shuli Gao, Shanshan Yuan, Siqi Liu, Shihong Deng, Shilei Jiang, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wuxun Xie, Weipeng Ming, Wenqing He, Wen Sun, Xin Han, Xin Huang, Xiaomin Deng, Xiaojia Liu, Xin Wu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaoyu Wang, Yaqiang Shi, Yilei Wang, Yizhuang Zhou, Yinmin Zhong, Yang Zhang, Yaoben Wei, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuchu Luo, Yuanhao Ding, Yuting Yan, Yaqi Dai, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zhisheng Guan, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
Title: Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction
Abstract:
Real-time speech interaction, serving as a fundamental interface for human-machine collaboration, holds immense potential. However, current open-source models face limitations such as high costs in voice data collection, weakness in dynamic control, and limited intelligence. To address these challenges, this paper introduces Step-Audio, the first production-ready open-source solution. Key contributions include: 1) a 130B-parameter unified speech-text multi-modal model that achieves unified understanding and generation, with the Step-Audio-Chat version open-sourced; 2) a generative speech data engine that establishes an affordable voice cloning framework and produces the open-sourced lightweight Step-Audio-TTS-3B model through distillation; 3) an instruction-driven fine control system enabling dynamic adjustments across dialects, emotions, singing, and RAP; 4) an enhanced cognitive architecture augmented with tool calling and role-playing abilities to manage complex tasks effectively. Based on our new StepEval-Audio-360 evaluation benchmark, Step-Audio achieves state-of-the-art performance in human evaluations, especially in terms of instruction following. On open-source benchmarks like LLaMA Question, shows 9.3% average performance improvement, demonstrating our commitment to advancing the development of open-source multi-modal language technologies. Our code and models are available at https://github.com/stepfun-ai/Step-Audio.
中文: 本文提出首个生产就绪的开源方案Step-Audio,通过1300亿参数统一语音文本模型、生成式数据引擎、动态控制系统和增强认知架构,在人工评估中实现最优性能,显著提升开源多模态技术发展。
English: This paper introduces Step-Audio, a production-ready open-source solution featuring a unified 130B-parameter speech-text model, generative data engine, dynamic control system, and enhanced cognitive architecture that achieves state-of-the-art performance in human evaluations.

Authors:Xuefeng Li, Haoyang Zou, Pengfei Liu
Title: LIMR: Less is More for RL Scaling
Abstract:
In this paper, we ask: what truly determines the effectiveness of RL training data for enhancing language models' reasoning capabilities? While recent advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack of transparency about training data requirements has hindered systematic progress. Starting directly from base models without distillation, we challenge the assumption that scaling up RL training data inherently improves performance. we demonstrate that a strategically selected subset of just 1,389 samples can outperform the full 8,523-sample dataset. We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples based on their alignment with model learning trajectories, enabling efficient resource utilization and scalable implementation. Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset. Notably, while recent data-efficient approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it significantly underperforms at 7B-scale through supervised fine-tuning (SFT). In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results fundamentally reshape our understanding of RL scaling in LLMs, demonstrating that precise sample selection, rather than data scale, may be the key to unlocking enhanced reasoning capabilities. For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models at https://github.com/GAIR-NLP/LIMR.
中文: 本研究表明,通过"学习影响度量"方法进行策略性样本选择比单纯扩大数据规模更能提升语言模型的推理能力,仅用1,389个样本就超越了完整数据集的性能表现。
English: This study reveals that strategic sample selection through Learning Impact Measurement (LIM) is more crucial than data volume for enhancing language models' reasoning, achieving superior performance with only 1,389 samples compared to full datasets.

Authors:Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
Title: Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration
Abstract:
Agents built on large language models (LLMs) have excelled in turn-by-turn human-AI collaboration but struggle with simultaneous tasks requiring real-time interaction. Latency issues and the challenge of inferring variable human strategies hinder their ability to make autonomous decisions without explicit instructions. Through experiments with current independent System 1 and System 2 methods, we validate the necessity of using Dual Process Theory (DPT) in real-time tasks. We propose DPT-Agent, a novel language agent framework that integrates System 1 and System 2 for efficient real-time simultaneous human-AI collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and code-as-policy for fast, intuitive, and controllable decision-making. DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous reflection to infer human intentions and perform reasoning-based autonomous decisions. We demonstrate the effectiveness of DPT-Agent through further experiments with rule-based agents and human collaborators, showing significant improvements over mainstream LLM-based frameworks. DPT-Agent can effectively help LLMs convert correct slow thinking and reasoning into executable actions, thereby improving performance. To the best of our knowledge, DPT-Agent is the first language agent framework that achieves successful real-time simultaneous human-AI collaboration autonomously. Code of DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
中文摘要:DPT-Agent是一种新型语言智能体框架,通过整合快速反应的系统1和基于推理的系统2,实现了自主的实时人机协作,克服了当前基于大语言模型的智能体在延迟和策略推断方面的局限。
English Summary: DPT-Agent is a novel language agent framework that integrates fast System 1 and reasoning-based System 2 processes to enable autonomous real-time human-AI collaboration, overcoming latency and strategy inference limitations of current LLM-based agents.

Authors:Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
Title: Bitnet.cpp: Efficient Edge Inference for Ternary LLMs
Abstract:
The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
中文摘要:Bitnet.cpp为三元大语言模型推出了优化的推理系统,采用创新的混合精度矩阵乘法库,在实现无损边缘部署的同时,相比全精度基线最高可提升6.25倍推理速度。
English Summary: Bitnet.cpp introduces an optimized inference system for ternary large language models, featuring a novel mixed-precision matrix multiplication library that achieves up to 6.25x speed improvement over full-precision baselines while enabling lossless edge deployment.

Authors:Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
Title: Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities
Abstract:
This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills specific functionality requirements based on a given flowchart, which visually represents the desired algorithm or process. Code-Vision comprises three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding abilities across basic programming, algorithmic, and mathematical problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision. Experimental results demonstrate that there is a large performance difference between proprietary and open-source models. On Hard problems, GPT-4o can achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further experiments reveal that Code-Vision can pose unique challenges compared to other multimodal reasoning benchmarks MMCode and MathVista. We also explore the reason for the poor performance of the open-source models. All data and codes are available at https://github.com/wanghanbinpanda/CodeVision.
中文: 本文介绍了Code-Vision基准,通过要求多模态大语言模型根据流程图生成功能程序来评估其逻辑理解和代码生成能力,实验显示专有模型(如GPT-4o)在复杂任务上显著优于开源模型,性能差距悬殊。
English: This paper presents Code-Vision, a benchmark for assessing MLLMs' logical reasoning and code generation by requiring them to produce functional programs from flowcharts, revealing a significant performance gap where proprietary models like GPT-4o vastly outperform open-source ones, especially on complex tasks.

Authors:Xuan Ren, Qi Chen, Lingqiao Liu
Title: Efficient Response Generation Strategy Selection for Fine-Tuning Large Language Models Through Self-Aligned Perplexity
Abstract:
Fine-tuning large language models (LLMs) typically relies on producing large sets of input-output pairs. Yet for a given question, there can be many valid outputs. In practice, these outputs are often derived by distilling knowledge from teacher models, and they can vary depending on the specific teacher model or prompting strategy employed. Recent findings show that how these training outputs are generated can significantly affect the performance of the fine-tuned model, raising an important question: how do we pick the best data generation method from among numerous possibilities? Rather than exhaustively training and evaluating on each candidate, this paper proposes a scalable approximate method that assesses a small subset of generated data to estimate its suitability for a specific target LLM. Our central idea is that effective outputs should be familiar to the target LLM. While previous work measures familiarity with perplexity, we find that perplexity might be suboptimal in characterizing familiarity through empirical analyses and practical observations. To address this, we introduce self-aligned perplexity, a novel metric capturing how closely candidate outputs adhere to the target LLM's own style and reasoning patterns. In this way, we can identify the most effective generation strategy on a small sample, then apply it to produce the complete training set. We demonstrate that training on data generated by the chosen method yields significant improvements across diverse reasoning-focused benchmarks, particularly in cases where different candidate methods lead to highly divergent training outcomes. Our implementation is publicly available at https://github.com/XuanRen4470/SPPL.
中文摘要:本文提出一种可扩展方法,通过自对齐困惑度评估少量生成数据来优选微调大语言模型的最佳数据生成策略,在多项推理基准测试中显著提升模型性能。
English Summary: This paper introduces a scalable method using self-aligned perplexity to efficiently select the best data generation strategy for fine-tuning LLMs by evaluating small data samples, which significantly improves model performance across reasoning benchmarks.

Authors:Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Title: Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation
Abstract:
The widespread deployment of Large Language Models (LLMs) is hindered by the high computational demands, making knowledge distillation (KD) crucial for developing compact smaller ones. However, the conventional KD methods endure the distribution mismatch issue between the teacher and student models, leading to the poor performance of distillation. For instance, the widely-used KL-based methods suffer the mode-averaging and mode-collapsing problems, since the mismatched probabitliy distribution between both models. Previous studies mainly optimize this issue via different distance calculations towards the distribution of both models. Unfortunately, the distribution mismatch issue still exists in the early stage of the distillation. Hence, to reduce the impact of distribution mismatch, we propose a simple yet efficient method, named Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Specifically, we first detect the distribution of the student model in practical scenarios with its internal knowledge, and then modify the knowledge with low probability via the teacher as the checker. Consequently, Warmup-Distill aligns the internal student's knowledge to that of the teacher, which expands the distribution of the student with the teacher's, and assists the student model to learn better in the subsequent distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation, which outperforms the vanilla student by as least +0.4 averaged score among all benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation on the math task could yield a further improvement, at most +1.9% accuracy.
中文摘要:大语言模型面临计算需求高的挑战,而提出的Warmup-Distill方法通过预先对齐师生模型的知识分布,有效解决了知识蒸馏中的分布不匹配问题,在多个基准测试中实现了性能提升。
English Summary: Large language models face computational challenges, but the proposed Warmup-Distill method effectively addresses distribution mismatch between teacher and student models by pre-aligning their knowledge, resulting in improved performance across multiple benchmarks.

Authors:Georg Wölflein, Dyke Ferber, Daniel Truhn, Ognjen Arandjelović, Jakob Nikolas Kather
Title: LLM Agents Making Agent Tools
Abstract:
Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers, hindering the applicability of LLM agents in domains demanding large numbers of highly specialised tools, like in life sciences and medicine. Motivated by the growing trend of scientific studies accompanied by public code repositories, we propose ToolMaker, an agentic framework that autonomously transforms papers with code into LLM-compatible tools. Given a GitHub URL and short task description, ToolMaker autonomously installs dependencies and generates code to perform the task, using a closed-loop self-correction mechanism for debugging. To evaluate our approach, we introduce a benchmark comprising 15 complex computational tasks spanning various domains with over 100 unit tests to assess correctness and robustness. Our method correctly implements 80% of the tasks, substantially outperforming current state-of-the-art software engineering agents. ToolMaker therefore is a step towards fully autonomous agent-based scientific workflows. Our code and benchmark are publicly available at https://github.com/KatherLab/ToolMaker.
中文: ToolMaker 是一个自主框架,能够将附带代码库的研究论文转化为大语言模型兼容的工具,使大语言模型无需人工干预即可创建专业软件组件,在复杂计算任务中显著优于现有智能体。
English: ToolMaker is an autonomous framework that converts research papers with code repositories into LLM-compatible tools, enabling large language models to create specialized software components without human intervention and significantly outperforming existing agents in complex computational tasks.

Authors:Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan
Title: CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
Abstract:
Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repository https://github.com/YuY-2001/C-MQCIC.
中文摘要:本研究提出了一种基于临床事实的推理规则方法,在中文电子病历的医疗质控指标计算中优于思维链方法,并通过在20个大语言模型上的全面实验验证了其有效性。
English Summary: This study introduces a clinical fact-based inferential rule method that outperforms chain-of-thought approaches in medical quality control indicator calculations using Chinese electronic medical records, supported by comprehensive experiments on 20 large language models.

Authors:Yuncheng Hua, Lizhen Qu, Zhuang Li, Hao Xue, Flora D. Salim, Gholamreza Haffari
Title: RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars
Abstract:
Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully. Current alignment approaches require high-quality annotations and significant training resources. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment. Through an analysis of high-quality ICL demos, we identified style as a key factor influencing LLM alignment capabilities and explicitly restyled ICL exemplars based on this stylistic framework. Additionally, we combined the restyled demos to achieve a balance between the two conflicting aspects of LLM alignment--factuality and safety. We packaged the restyled examples as prompts to trigger few-shot learning, improving LLM alignment. Compared to the best baseline approach, with an average score of 5.00 as the maximum, our method achieves a maximum 0.10 increase on the Alpaca task (from 4.50 to 4.60), a 0.22 enhancement on the Just-eval benchmark (from 4.34 to 4.56), and a maximum improvement of 0.32 (from 3.53 to 3.85) on the MT-Bench dataset. We release the code and data at https://github.com/AnonymousCode-ComputerScience/RIDE.
中文: 本文提出了一种低成本、无需调优的方法,通过情境学习重构示例风格来平衡事实性与安全性,从而提升大语言模型的对齐效果,并在多个基准测试中取得显著提升。
English: This paper introduces a low-cost, tuning-free method using in-context learning to enhance LLM alignment by restyling exemplars to balance factuality and safety, achieving significant improvements across multiple benchmarks.

Authors:Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu
Title: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Abstract:
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.
中文摘要:研究表明,通过针对性改写或推理时中和的方法可以有效消除大语言模型中水印的放射性,在保持知识传递的同时破坏水印检测,凸显了开发更强水印防御机制的迫切需求。
English Summary: The study reveals that watermark radioactivity in LLMs can be effectively removed through targeted paraphrasing or inference-time neutralization, compromising detection while preserving knowledge transfer, highlighting the need for more robust watermarking defenses.

Authors:Xiaoyu Tan, Tianchu Yao, Chao Qu, Bin Li, Minghao Yang, Dakuan Lu, Haozhe Wang, Xihe Qiu, Wei Chu, Yinghui Xu, Yuan Qi
Title: AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification
Abstract:
The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model's ability to validate outputs and improving training accuracy. To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses. The project will be open-sourced at https://auroraprm.github.io/. The Universal-PRM-7B is available at https://huggingface.co/infly/Universal-PRM-7B.

Authors:Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang
Title: Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More
Abstract:
Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.
中文: 本文提出DART方法,通过基于视觉令牌与关键令牌的重复性进行剪枝,无需训练即可实现高达88.9%的令牌削减和近3倍加速,同时保持模型性能。
English: This paper introduces DART, a training-free method that accelerates multimodal models by pruning vision tokens based on duplication with pivot tokens, achieving up to 88.9% token reduction and nearly 3x speed-up while maintaining performance.

Authors:Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
Title: Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Abstract:
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks, yet they often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison, which are essential for relevant complex tasks like chart understanding and geometric reasoning. In this work, we first investigate the root causes of this deficiency through a suite of probing tasks focusing on basic visual arithmetic. Our analysis reveals that while pre-trained vision encoders typically capture sufficient information, the text decoder often fails to decode it correctly for arithmetic reasoning. To address this, we propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development. CogAlign trains VLMs to recognize invariant properties under visual transformations. We demonstrate that this approach significantly improves the performance of three diverse VLMs on our proposed probing tasks. Furthermore, CogAlign enhances performance by an average of 4.6% on CHOCOLATE and 2.9% on MATH-VISION, outperforming or matching supervised fine-tuning methods while requiring only 60% less training data. These results highlight the effectiveness and generalizability of CogAlign in improving fundamental visual arithmetic capabilities and their transfer to downstream tasks.
Chinese: 视觉语言模型常因文本解码器问题在视觉算术任务中表现不佳,但提出的CogAlign训练策略显著提升了其性能与泛化能力,同时减少了数据需求。
English: Vision Language Models often fail at visual arithmetic tasks due to text decoder limitations, but the proposed CogAlign training strategy significantly enhances their performance and generalizability with less data.

Authors:Haochen Li, Wanjin Feng, Xin Zhou, Zhiqi Shen
Title: GiFT: Gibbs Fine-Tuning for Code Generation
Abstract:
Training Large Language Models (LLMs) with synthetic data is a prevalent practice in code generation. A key approach is self-training, where LLMs are iteratively trained on self-generated correct code snippets. In this case, the self-generated codes are drawn from a conditional distribution, conditioned on a specific seed description. However, the seed description is not the only valid representation that aligns with its intended meaning. With all valid descriptions and codes forming a joint space, codes drawn from the conditional distribution would lead to an underrepresentation of the full description-code space. As such, we propose Gibbs Fine-Tuning (GiFT), a novel self-training method inspired by Gibbs sampling. GiFT allows self-generated data to be drawn from the marginal distribution of the joint space, thereby mitigating the biases inherent in conditional sampling. We provide a theoretical analysis demonstrating the potential benefits of fine-tuning LLMs with code derived from the marginal distribution. Furthermore, we propose a perplexity-based code selection method to mitigate the imbalanced long-tail distribution of the self-generated codes. Empirical evaluation of two LLMs across four datasets demonstrates that GiFT achieves superior performance, particularly on more challenging benchmarks. Source code is available at https://github.com/Alex-HaochenLi/GiFT.
中文摘要:本研究提出吉布斯微调(GiFT)方法,通过从描述-代码联合空间的边际分布中采样,解决了代码生成中的代表性不足问题,显著提升了大型语言模型在复杂基准测试中的表现。
English Summary: The study introduces Gibbs Fine-Tuning (GiFT), a self-training method that addresses the underrepresentation in code generation by sampling from the marginal distribution of the joint description-code space, enhancing LLM performance on challenging benchmarks.

Authors:Junru Lu, Jiazheng Li, Guodong Shen, Lin Gui, Siyu An, Yulan He, Di Yin, Xing Sun
Title: RoleMRC: A Fine-Grained Composite Benchmark for Role-Playing and Instruction-Following
Abstract:
Role-playing is important for Large Language Models (LLMs) to follow diverse instructions while maintaining role identity and the role's pre-defined ability limits. Existing role-playing datasets mostly contribute to controlling role style and knowledge boundaries, but overlook role-playing in instruction-following scenarios. We introduce a fine-grained role-playing and instruction-following composite benchmark, named RoleMRC, including: (1) Multi-turn dialogues between ideal roles and humans, including free chats or discussions upon given passages; (2) Role-playing machine reading comprehension, involving response, refusal, and attempts according to passage answerability and role ability; (3) More complex scenarios with nested, multi-turn and prioritized instructions. The final RoleMRC features a 10.2k role profile meta-pool, 37.9k well-synthesized role-playing instructions, and 1.4k testing samples. We develop a pipeline to quantitatively evaluate the fine-grained role-playing and instruction-following capabilities of several mainstream LLMs, as well as models that are fine-tuned on our data. Moreover, cross-evaluation on external role-playing datasets confirms that models fine-tuned on RoleMRC enhances instruction-following without compromising general role-playing and reasoning capabilities. We also probe the neural-level activation maps of different capabilities over post-tuned LLMs. Access to our RoleMRC, RoleMRC-mix and Codes: https://github.com/LuJunru/RoleMRC.
中文:RoleMRC是一个新颖的基准,通过多轮对话和复杂场景增强大语言模型的细粒度角色扮演和指令遵循能力,评估显示其在提升性能的同时不损害通用能力。
English: RoleMRC is a novel benchmark designed to enhance large language models' fine-grained role-playing and instruction-following abilities through multi-turn dialogues and complex scenarios, with evaluations showing improved performance without sacrificing general capabilities.

Authors:Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
Title: Sparse Autoencoder Features for Classifications and Transferability
Abstract:
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
中文摘要:稀疏自编码器能够从大语言模型中提取可解释特征,在安全关键任务中超越基线方法,并通过优化配置实现跨模型与跨任务的泛化能力。
English Summary: Sparse Autoencoders effectively extract interpretable features from Large Language Models, surpassing baseline methods in safety-critical tasks and enabling cross-model and cross-task generalization with optimized configurations.

Authors:Shaina Raza, Ashmal Vayani, Aditya Jain, Aravind Narayanan, Vahid Reza Khazaie, Syed Raza Bashir, Elham Dolatabadi, Gias Uddin, Christos Emmanouilidis, Rizwan Qureshi, Mubarak Shah
Title: VLDBench Evaluating Multimodal Disinformation with Regulatory Alignment
Abstract:
Detecting disinformation that blends manipulated text and images has become increasingly challenging, as AI tools make synthetic content easy to generate and disseminate. While most existing AI safety benchmarks focus on single modality misinformation (i.e., false content shared without intent to deceive), intentional multimodal disinformation, such as propaganda or conspiracy theories that imitate credible news, remains largely unaddressed. We introduce the Vision-Language Disinformation Detection Benchmark (VLDBench), the first large-scale resource supporting both unimodal (text-only) and multimodal (text + image) disinformation detection. VLDBench comprises approximately 62,000 labeled text-image pairs across 13 categories, curated from 58 news outlets. Using a semi-automated pipeline followed by expert review, 22 domain experts invested over 500 hours to produce high-quality annotations with substantial inter-annotator agreement. Evaluations of state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs) on VLDBench show that incorporating visual cues improves detection accuracy by 5 to 35 percentage points over text-only models. VLDBench provides data and code for evaluation, fine-tuning, and robustness testing to support disinformation analysis. Developed in alignment with AI governance frameworks (e.g., the MIT AI Risk Repository), VLDBench offers a principled foundation for advancing trustworthy disinformation detection in multimodal media. Project: https://vectorinstitute.github.io/VLDBench/ Dataset: https://huggingface.co/datasets/vector-institute/VLDBench Code: https://github.com/VectorInstitute/VLDBench
中文: VLDBench是首个支持单模态和多模态虚假信息检测的大规模基准,研究表明结合视觉线索比纯文本模型将检测准确率提高了5-35%。
English: VLDBench is the first large-scale benchmark for detecting both unimodal and multimodal disinformation, showing that incorporating visual cues improves detection accuracy by 5-35% over text-only models.

Authors:Rongwu Xu, Xiaojian Li, Shuo Chen, Wei Xu
Title: Nuclear Deployed: Analyzing Catastrophic Risks in Decision-making of Autonomous LLM Agents
Abstract:
Large language models (LLMs) are evolving into autonomous decision-makers, raising concerns about catastrophic risks in high-stakes scenarios, particularly in Chemical, Biological, Radiological and Nuclear (CBRN) domains. Based on the insight that such risks can originate from trade-offs between the agent's Helpful, Harmlessness and Honest (HHH) goals, we build a novel three-stage evaluation framework, which is carefully constructed to effectively and naturally expose such risks. We conduct 14,400 agentic simulations across 12 advanced LLMs, with extensive experiments and analysis. Results reveal that LLM agents can autonomously engage in catastrophic behaviors and deception, without being deliberately induced. Furthermore, stronger reasoning abilities often increase, rather than mitigate, these risks. We also show that these agents can violate instructions and superior commands. On the whole, we empirically prove the existence of catastrophic risks in autonomous LLM agents. We release our code to foster further research.
中文摘要:大型语言模型在作为自主智能体时存在灾难性风险,实验表明更强的推理能力反而会加剧危险行为,包括欺骗和违反指令,而非降低风险。
English Summary: Large language models acting as autonomous agents pose catastrophic risks in high-stakes CBRN scenarios, with experiments revealing that stronger reasoning capabilities often amplify rather than reduce dangerous behaviors including deception and command violations.

Authors:Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Title: CORDIAL: Can Multimodal Large Language Models Effectively Understand Coherence Relationships?
Abstract:
Multimodal Large Language Models (MLLMs) are renowned for their superior instruction-following and reasoning capabilities across diverse problem domains. However, existing benchmarks primarily focus on assessing factual and logical correctness in downstream tasks, with limited emphasis on evaluating MLLMs' ability to interpret pragmatic cues and intermodal relationships. To address this gap, we assess the competency of MLLMs in performing Multimodal Discourse Analysis (MDA) using Coherence Relations. Our benchmark, CORDIAL, encompasses a broad spectrum of Coherence Relations across 3 different discourse domains at varying levels of granularity. Through our experiments on 10+ MLLMs employing different prompting strategies, we show that even top models like Gemini 1.5 Pro and GPT-4o fail to match the performance of simple classifier-based baselines. This study emphasizes the need to move beyond similarity-based metrics and adopt a discourse-driven framework for evaluating MLLMs, providing a more nuanced assessment of their capabilities. The benchmark and code are available at: https://aashish2000.github.io/CORDIAL/

Authors:Yanran Wu, Inez Hua, Yi Ding
Title: Unveiling Environmental Impacts of Large Language Model Serving: A Functional Unit View
Abstract:
Large language models (LLMs) offer powerful capabilities but come with significant environmental impact, particularly in carbon emissions. Existing studies benchmark carbon emissions but lack a standardized basis for comparison across different model configurations. To address this, we introduce the concept of functional unit (FU) as a standardized basis and develop FUEL, the first FU-based framework for evaluating LLM serving's environmental impact. Through three case studies, we uncover key insights and trade-offs in reducing carbon emissions by optimizing model size, quantization strategy, and hardware choice, paving the way for more sustainable LLM serving. The code is available at https://github.com/jojacola/FUEL.
中文: 研究者提出基于功能单元的FUEL框架,为大型语言模型的环境影响评估建立统一标准,并通过案例研究揭示模型配置优化对降低碳排放的关键作用。
English: The authors propose FUEL, a functional unit-based framework to standardize environmental impact assessments of large language models, demonstrating through case studies how optimizing model configurations can reduce carbon emissions.

Authors:Sayantan Adak, Somnath Banerjee, Rajarshi Mandal, Avik Halder, Sayan Layek, Rima Hazra, Animesh Mukherjee
Title: MemeSense: An Adaptive In-Context Framework for Social Commonsense Driven Meme Moderation
Abstract:
Memes present unique moderation challenges due to their subtle, multimodal interplay of images, text, and social context. Standard systems relying predominantly on explicit textual cues often overlook harmful content camouflaged by irony, symbolism, or cultural references. To address this gap, we introduce MemeSense, an adaptive in-context learning framework that fuses social commonsense reasoning with visually and semantically related reference examples. By encoding crucial task information into a learnable cognitive shift vector, MemeSense effectively balances lexical, visual, and ethical considerations, enabling precise yet context-aware meme intervention. Extensive evaluations on a curated set of implicitly harmful memes demonstrate that MemeSense substantially outperforms strong baselines, paving the way for safer online communities. Code and data available at: https://github.com/sayantan11995/MemeSense
中文: MemeSense提出了一种自适应框架,融合社会常识推理与多模态参考,有效检测并干预标准系统难以识别的有害模因,显著提升了内容审核的效果。
English: MemeSense introduces an adaptive framework that combines social commonsense reasoning with multimodal references to effectively detect and intervene in harmful memes overlooked by standard systems, significantly improving moderation performance.

Authors:Yixin Ou, Yunzhi Yao, Ningyu Zhang, Hui Jin, Jiacheng Sun, Shumin Deng, Zhenguo Li, Huajun Chen
Title: How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
Abstract:
Despite exceptional capabilities in knowledge-intensive tasks, Large Language Models (LLMs) face a critical gap in understanding how they internalize new knowledge, particularly how to structurally embed acquired knowledge in their neural computations. We address this issue through the lens of knowledge circuit evolution, identifying computational subgraphs that facilitate knowledge storage and processing. Our systematic analysis of circuit evolution throughout continual pre-training reveals several key findings: (1) the acquisition of new knowledge is influenced by its relevance to pre-existing knowledge; (2) the evolution of knowledge circuits exhibits a distinct phase shift from formation to optimization; (3) the evolution of knowledge circuits follows a deep-to-shallow pattern. These insights not only advance our theoretical understanding of the mechanisms of new knowledge acquisition in LLMs, but also provide potential implications for improving continual pre-training strategies to enhance model performance. Code and data will be available at https://github.com/zjunlp/DynamicKnowledgeCircuits.
Chinese: 本研究通过知识回路演化的视角,揭示了大语言模型获取新知识受其与已有知识相关性影响,遵循从深层到浅层的模式,并经历形成到优化的阶段转变,为改进持续预训练策略提供了理论依据。
English: This study investigates how Large Language Models structurally embed new knowledge through knowledge circuit evolution, revealing that acquisition depends on relevance to existing knowledge, follows a deep-to-shallow pattern, and shifts from formation to optimization, offering insights to improve continual pre-training strategies.

Authors:Haoming Xu, Ningyuan Zhao, Liming Yang, Sendong Zhao, Shumin Deng, Mengru Wang, Bryan Hooi, Nay Oo, Huajun Chen, Ningyu Zhang
Title: ReLearn: Unlearning via Learning for Large Language Models
Abstract:
Current unlearning methods for large language models usually rely on reverse optimization to reduce target token probabilities. However, this paradigm disrupts the subsequent tokens prediction, degrading model performance and linguistic coherence. Moreover, existing evaluation metrics overemphasize contextual forgetting while inadequately assessing response fluency and relevance. To address these challenges, we propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning, along with a comprehensive evaluation framework. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation, and Linguistic Score (LS) to evaluate generation quality. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output. Through mechanistic analysis, we further demonstrate how reverse optimization disrupts coherent text generation, while ReLearn preserves this essential capability. Code is available at https://github.com/zjunlp/unlearn.
中文摘要:提出的ReLearn方法通过数据增强和微调有效实现大语言模型的定向遗忘,同时保持输出质量,优于会破坏语言连贯性的反向优化方法。
English Summary: The proposed ReLearn method effectively achieves targeted unlearning in large language models through data augmentation and fine-tuning while maintaining output quality, outperforming reverse optimization approaches that compromise linguistic coherence.

Authors:Ante Wang, Linfeng Song, Ye Tian, Dian Yu, Haitao Mi, Xiangyu Duan, Zhaopeng Tu, Jinsong Su, Dong Yu
Title: Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls
Abstract:
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs), but at the cost of increased computational resources. In this work, we identify two key challenges contributing to this inefficiency: $\textit{over-exploration}$ due to redundant states with semantically equivalent content, and $\textit{under-exploration}$ caused by high variance in verifier scoring leading to frequent trajectory switching. To address these issues, we propose FETCH, an e$\textbf{f}$fici$\textbf{e}$nt $\textbf{t}$ree sear$\textbf{ch}$ framework, which is a flexible, plug-and-play system compatible with various tree search algorithms. Our framework mitigates over-exploration by merging semantically similar states using agglomerative clustering of text embeddings obtained from a fine-tuned SimCSE model. To tackle under-exploration, we enhance verifiers by incorporating temporal difference learning with adjusted $λ$-returns during training to reduce variance, and employing a verifier ensemble to aggregate scores during inference. Experiments on GSM8K, GSM-Plus, and MATH datasets demonstrate that our methods significantly improve reasoning accuracy and computational efficiency across four different tree search algorithms, paving the way for more practical applications of LLM-based reasoning. The code is available at https://github.com/Soistesimmer/Fetch.
中文: FETCH框架通过聚合语义相似状态减少过度探索,并利用时序差分学习增强验证器以解决探索不足,从而显著提升大语言模型的推理准确性和计算效率。
English: The FETCH framework enhances tree search efficiency in large language models by merging semantically similar states to reduce over-exploration and improving verifier reliability with temporal difference learning to address under-exploration, thereby boosting both reasoning accuracy and computational performance.

Authors:Tianshi Zheng, Jiayang Cheng, Chunyang Li, Haochen Shi, Zihao Wang, Jiaxin Bai, Yangqiu Song, Ginny Y. Wong, Simon See
Title: LogiDynamics: Unraveling the Dynamics of Inductive, Abductive and Deductive Logical Inferences in LLM Reasoning
Abstract:
Modern large language models (LLMs) employ diverse logical inference mechanisms for reasoning, making the strategic optimization of these approaches critical for advancing their capabilities. This paper systematically investigate the comparative dynamics of inductive (System 1) versus abductive/deductive (System 2) inference in LLMs. We utilize a controlled analogical reasoning environment, varying modality (textual, visual, symbolic), difficulty, and task format (MCQ / free-text). Our analysis reveals System 2 pipelines generally excel, particularly in visual/symbolic modalities and harder tasks, while System 1 is competitive for textual and easier problems. Crucially, task format significantly influences their relative advantage, with System 1 sometimes outperforming System 2 in free-text rule-execution. These core findings generalize to broader in-context learning. Furthermore, we demonstrate that advanced System 2 strategies like hypothesis selection and iterative refinement can substantially scale LLM reasoning. This study offers foundational insights and actionable guidelines for strategically deploying logical inference to enhance LLM reasoning. Resources are available at https://github.com/HKUST-KnowComp/LogiDynamics.
中文摘要:本研究表明,在复杂视觉/符号任务中系统2推理通常优于系统1,而系统1在简单文本任务中仍具竞争力,且任务格式显著影响两者的相对表现优势。
English Summary: This study demonstrates that System 2 reasoning generally outperforms System 1 in complex visual/symbolic tasks, while System 1 remains competitive in simpler textual tasks, with task format significantly influencing their relative performance.

Authors:Jeonghyun Park, Hwanhee Lee
Title: Investigating Language Preference of Multilingual RAG Systems
Abstract:
Multilingual Retrieval-Augmented Generation (mRAG) systems enhance language models by integrating external multilingual information to produce context-aware responses. However, mRAG systems struggle with retrieving relevant information due to linguistic variations between queries and documents, generating inconsistent responses when multilingual sources conflict. In this work, we systematically investigate language preferences in both retrieval and generation of mRAG through a series of experiments. Our analysis indicates that retrievers tend to prefer high-resource and query languages, yet this preference does not consistently improve generation performance. Moreover, we observe that generators prefer the query language or Latin scripts, leading to inconsistent outputs. To overcome these issues, we propose Dual Knowledge Multilingual RAG (DKM-RAG), a simple yet effective framework that fuses translated multilingual passages with complementary model knowledge. Empirical results demonstrate that DKM-RAG mitigates language preference in generation and enhances performance across diverse linguistic settings. Code is available at https://github.com/jeonghyunpark2002/LanguagePreference.git
中文:多语言检索增强生成系统因查询与文档间的语言差异及多语言源冲突而难以检索相关信息并产生不一致响应,为此提出的双重知识多语言RAG框架融合翻译段落与模型知识,有效缓解语言偏好并提升跨语言性能。
English: Multilingual Retrieval-Augmented Generation (mRAG) systems face challenges in retrieving relevant information and generating consistent responses due to linguistic variations and conflicting sources, which are addressed by the proposed Dual Knowledge mRAG (DKM-RAG) framework that fuses translated passages with model knowledge to improve performance across languages.

Authors:Bohan Lyu, Siqiao Huang, Zichen Liang, Qi-An Sun, Jiaming Zhang
Title: SURGE: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors
Abstract:
Neural surrogate models have emerged as powerful and efficient tools in data mining. Meanwhile, large language models (LLMs) have demonstrated remarkable capabilities in code-related tasks. We investigate a novel application: using LLMs as surrogate models for code execution prediction. Given LLMs' unique ability to understand and process diverse programs, they present a promising direction for building general-purpose surrogate models. To systematically investigate this capability, we introduce SURGE, a comprehensive benchmark with $1160$ problems covering $8$ key aspects: multi-language programming tasks, competition-level programming problems, repository-level code analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, programs dependent on specific compilers or execution environments, and formal mathematical proof verification. Through extensive empirical analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings reveal important insights about the feasibility of LLMs as efficient surrogates for computational processes, with implications for automated software testing, program analysis, and computational resource optimization in data mining applications. Code and dataset are released at https://github.com/Imbernoulli/SURGE.
中文摘要:本研究通过SURGE基准系统评估大语言模型在代码执行预测中作为神经代理模型的可行性,涵盖多语言编程、竞赛题目等八大维度,对21个模型的测试揭示了其在计算过程中替代作用的重要潜力。
English Summary: The study introduces SURGE, a benchmark evaluating whether large language models (LLMs) can effectively serve as neural surrogate models for code execution prediction across diverse programming scenarios, revealing key insights about their feasibility through comprehensive testing of 21 models.

Authors:Jingyuan Huang, Jen-tse Huang, Ziyi Liu, Xiaoyuan Liu, Wenxuan Wang, Jieyu Zhao
Title: AI Sees Your Location, But With A Bias Toward The Wealthy World
Abstract:
Visual-Language Models (VLMs) have shown remarkable performance across various tasks, particularly in recognizing geographic information from images. However, VLMs still show regional biases in this task. To systematically evaluate these issues, we introduce a benchmark consisting of 1,200 images paired with detailed geographic metadata. Evaluating four VLMs, we find that while these models demonstrate the ability to recognize geographic information from images, achieving up to 53.8% accuracy in city prediction, they exhibit significant biases. Specifically, performance is substantially higher for economically developed and densely populated regions compared to less developed (-12.5%) and sparsely populated (-17.0%) areas. Moreover, regional biases of frequently over-predicting certain locations remain. For instance, they consistently predict Sydney for images taken in Australia, shown by the low entropy scores for these countries. The strong performance of VLMs also raises privacy concerns, particularly for users who share images online without the intent of being identified. Our code and dataset are publicly available at https://github.com/uscnlp-lime/FairLocator.
中文: 视觉语言模型在识别图像地理信息方面表现出显著准确性,但存在明显区域偏见,偏向发达和人口稠密地区,同时引发了对在线图像分享隐私问题的担忧。
English: Visual-Language Models demonstrate notable accuracy in identifying geographic details from images but exhibit significant regional biases, favoring developed and densely populated areas while raising privacy concerns for online image sharing.

Authors:Yuqi Liu, Yan Zheng
Title: Improving Similar Case Retrieval Ranking Performance By Revisiting RankSVM
Abstract:
Given the rapid development of Legal AI, a lot of attention has been paid to one of the most important legal AI tasks--similar case retrieval, especially with language models to use. In our paper, however, we try to improve the ranking performance of current models from the perspective of learning to rank instead of language models. Specifically, we conduct experiments using a pairwise method--RankSVM as the classifier to substitute a fully connected layer, combined with commonly used language models on similar case retrieval datasets LeCaRDv1 and LeCaRDv2. We finally come to the conclusion that RankSVM could generally help improve the retrieval performance on the LeCaRDv1 and LeCaRDv2 datasets compared with original classifiers by optimizing the precise ranking. It could also help mitigate overfitting owing to class imbalance. Our code is available in https://github.com/liuyuqi123study/RankSVM_for_SLR
中文摘要:本文通过在学习排序框架中用RankSVM替代传统分类器,显著提升了相似案例检索在LeCaRD数据集上的排序精度并缓解了过拟合问题。
English Summary: This paper enhances similar case retrieval performance by replacing traditional classifiers with RankSVM in learning-to-rank frameworks, demonstrating improved ranking accuracy and reduced overfitting on LeCaRD datasets.

Authors:Yijie Chen, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
Title: Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping
Abstract:
Knowledge Distillation (KD) has emerged as a prominent technique for model compression. However, conventional KD approaches primarily focus on homogeneous architectures with identical tokenizers, constraining their applicability in cross-architecture scenarios. As for the cross-tokenizer KD, the differences in the tokenizers give rise to two fundamental challenges: (1) sequence misalignment caused by divergent tokenization strategies, and (2) mismatched vocabulary size and composition. While existing probability-matching methods attempt to address these issues, their efficacy remains limited due to suboptimal alignment in both the sequence and vocabulary aspects. To overcome these limitations, we propose Contextual Dynamic Mapping (CDM), a novel cross-tokenizer distillation framework that employs contextual information to enhance sequence alignment precision and dynamically improves vocabulary mapping. We evaluated the effectiveness of our approach across five advanced and widely-used model families (i.e, LLama3, Phi3, Gemma2, OPT and Qwen2), which were configured into three distinct teacher-student pairs. Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks, including instruction-following, code generation and math. Notably, our analysis reveals that combining conventional same-tokenizer distillation and cross-tokenizer distillation through CDM yields further performance improvements. The code is available at https://github.com/pppa2019/ContexualDynamicMapping
中文摘要:知识蒸馏在跨分词器场景下面临序列不对齐和词汇不匹配的挑战,而提出的上下文动态映射(CDM)框架通过增强对齐精度和动态词汇映射,在多种模型家族中实现了显著性能提升。
English Summary: Knowledge distillation faces challenges in cross-tokenizer scenarios due to sequence misalignment and vocabulary mismatch, which the proposed Contextual Dynamic Mapping (CDM) framework addresses by enhancing alignment precision and dynamic vocabulary mapping across multiple model families.

Authors:Yuting Huang, Chengyuan Liu, Yifeng Feng, Yiquan Wu, Chao Wu, Fei Wu, Kun Kuang
Title: Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction
Abstract:
As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The jailbreak is more efficient and hard to identify since no additional features are introduced. Extensive experiments and analysis demonstrate the effectiveness of R2J, and we find that the jailbreak is also transferable to multiple datasets and various types of models with only a few queries. We hope our work motivates further investigation of LLM safety. The code can be found at https://github.com/ythuang02/R2J/.
The paper introduces Rewrite to Jailbreak (R2J), a transferable black-box method that efficiently attacks Large Language Models by iteratively rewriting instructions to exploit model weaknesses without introducing detectable patterns.
English Summary:

Authors:Haoyang Li, Xuejia Chen, Zhanchao XU, Darian Li, Nicole Hu, Fei Teng, Yiming Li, Luyu Qiu, Chen Jason Zhang, Qing Li, Lei Chen
Title: Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic arithmetic, numerical retrieval, and magnitude comparison, remains surprisingly poor. This gap arises from their reliance on surface-level statistical patterns rather than understanding numbers as continuous magnitudes. Existing benchmarks primarily focus on either linguistic competence or structured mathematical problem-solving, neglecting fundamental numerical reasoning required in real-world scenarios. To bridge this gap, we propose NumericBench, a comprehensive benchmark to evaluate six fundamental numerical capabilities: number recognition, arithmetic operations, contextual retrieval, comparison, summary, and logical reasoning. NumericBench includes datasets ranging from synthetic number lists to the crawled real-world data, addressing challenges like long contexts, noise, and multi-step reasoning. Extensive experiments on state-of-the-art LLMs, including GPT-4 and DeepSeek, reveal persistent weaknesses in numerical reasoning, highlighting the urgent need to improve numerically-aware language modeling. The benchmark is released in: https://github.com/TreeAI-Lab/NumericBench.
Chinese: 大语言模型在语言任务上表现出色,但在数值推理方面存在明显不足,因其依赖表层统计模式,为此我们提出NumericBench基准来评估六项核心数值能力,并揭示GPT-4等模型的持续缺陷。
English: Large Language Models excel in linguistic tasks but struggle with numerical reasoning due to their reliance on surface patterns, prompting the creation of NumericBench to evaluate six core numerical skills and reveal persistent weaknesses in models like GPT-4.

Authors:Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, Dacheng Tao
Title: Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
Abstract:
Multi-turn jailbreak attacks simulate real-world human interactions by engaging large language models (LLMs) in iterative dialogues, exposing critical safety vulnerabilities. However, existing methods often struggle to balance semantic coherence with attack effectiveness, resulting in either benign semantic drift or ineffective detection evasion. To address this challenge, we propose Reasoning-Augmented Conversation, a novel multi-turn jailbreak framework that reformulates harmful queries into benign reasoning tasks and leverages LLMs' strong reasoning capabilities to compromise safety alignment. Specifically, we introduce an attack state machine framework to systematically model problem translation and iterative reasoning, ensuring coherent query generation across multiple turns. Building on this framework, we design gain-guided exploration, self-play, and rejection feedback modules to preserve attack semantics, enhance effectiveness, and sustain reasoning-driven attack progression. Extensive experiments on multiple LLMs demonstrate that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios, with attack success rates (ASRs) increasing by up to 96%. Notably, our approach achieves ASRs of 82% and 92% against leading commercial models, OpenAI o1 and DeepSeek R1, underscoring its potency. We release our code at https://github.com/NY1024/RACE to facilitate further research in this critical domain.
中文摘要:提出的推理增强对话(RACE)框架通过将有害查询重构为良性推理任务,在多轮越狱攻击中实现了最先进的攻击效果,对主流大模型的攻击成功率最高提升96%。
English Summary: The proposed Reasoning-Augmented Conversation (RACE) framework enhances multi-turn jailbreak attacks by transforming harmful queries into benign reasoning tasks, achieving state-of-the-art effectiveness with up to 96% higher success rates against leading LLMs.

Authors:Jiahao Huo, Yibo Yan, Xu Zheng, Yuanhuiyi Lyu, Xin Zou, Zhihua Wei, Xuming Hu
Title: MMUnlearner: Reformulating Multimodal Machine Unlearning in the Era of Multimodal Large Language Models
Abstract:
Recent progress in Machine Unlearning (MU) has introduced solutions for the selective removal of private or sensitive information encoded within deep neural networks. Nonetheless, MU for Multimodal Large Language Models (MLLMs) remains in its nascent phase. Therefore, we propose to reformulate the task of multimodal MU in the era of MLLMs, which aims to erase only the visual patterns associated with a given entity while preserving the corresponding textual knowledge encoded within the original parameters of the language model backbone. Furthermore, we develop a novel geometry-constrained gradient ascent method MMUnlearner. It updates the weights of MLLMs with a weight saliency map jointly restricted by the remaining concepts and textual knowledge during unlearning, thereby preserving parameters essential for non-target knowledge. Extensive experiments demonstrate that MMUnlearner surpasses baselines that finetuning MLLMs with VQA data directly through Gradient Ascent (GA) or Negative Preference Optimization (NPO), across all evaluation dimensions. Our code can be found in [this URL](https://github.com/Z1zs/MMUnlearner).
中文: 本研究提出MMUnlearner这一多模态机器遗忘新方法,能在多模态大语言模型中选择性消除特定实体的视觉模式同时保留文本知识,在所有评估维度上均优于现有技术。
English: This study introduces MMUnlearner, a novel method for multimodal machine unlearning that selectively erases visual patterns of specific entities in MLLMs while preserving textual knowledge, outperforming existing techniques across all evaluation metrics.

Authors:Shijing Hu, Jingyang Li, Xingyu Xie, Zhihui Lu, Kim-Chuan Toh, Pan Zhou
Title: GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
Abstract:
Speculative decoding accelerates inference in large language models (LLMs) by generating multiple draft tokens simultaneously. However, existing methods often struggle with token misalignment between the training and decoding phases, limiting their performance. To address this, we propose GRIFFIN, a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model to mitigate misalignment. The training strategy employs a loss masking mechanism to exclude highly misaligned tokens during training, preventing them from negatively impacting the draft model's optimization. The token-alignable draft model introduces input tokens to correct inconsistencies in generated features. Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%, outperforming current speculative decoding state-of-the-art methods. Our code and GRIFFIN's draft models are released publicly in https://github.com/hsj576/GRIFFIN.
中文: GRIFFIN通过可对齐令牌的训练策略和草稿模型解决推测解码中的错位问题,在多个大语言模型上实现超过8%的接受长度提升和7%以上的加速效果。
English: GRIFFIN introduces a token-alignable training strategy and draft model to mitigate misalignment in speculative decoding, achieving over 8% improvement in acceptance length and exceeding 7% speedup across multiple LLMs.

Authors:Pengcheng Jiang, Lang Cao, Ruike Zhu, Minhao Jiang, Yunyi Zhang, Jimeng Sun, Jiawei Han
Title: RAS: Retrieval-And-Structuring for Knowledge-Intensive LLM Generation
Abstract:
Large language models (LLMs) have achieved impressive performance on knowledge-intensive tasks, yet they often struggle with multi-step reasoning due to the unstructured nature of retrieved context. While retrieval-augmented generation (RAG) methods provide external information, the lack of explicit organization among retrieved passages limits their effectiveness, leading to brittle reasoning pathways. Recent interpretability studies highlighting the importance of structured intermediate reasoning further align with this perspective. We propose Retrieval-And-Structuring (RAS), a framework that dynamically constructs query-specific knowledge graphs through iterative retrieval and structured knowledge building. RAS interleaves targeted retrieval planning with incremental graph construction, enabling models to assemble and reason over evolving knowledge structures tailored to each query. On seven knowledge-intensive benchmarks, RAS consistently outperforms strong baselines, achieving up to 6.4% and 7.0% gains with open-source and proprietary LLMs, respectively. Our results demonstrate that dynamic, query-specific knowledge structuring offers a robust path to improving reasoning accuracy and robustness in language model generation. Our data and code can be found at https://github.com/pat-jj/RAS.
Chinese: 提出的检索与结构化(RAS)框架通过迭代检索和结构化动态构建特定查询的知识图谱,在多个基准测试中显著提升了语言模型的推理准确性和鲁棒性。
English: The proposed Retrieval-And-Structuring (RAS) framework dynamically builds query-specific knowledge graphs through iterative retrieval and structuring, significantly enhancing reasoning accuracy and robustness in language models across multiple benchmarks.

Authors:Yixuan Tang, Yi Yang
Title: FinMTEB: Finance Massive Text Embedding Benchmark
Abstract:
Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advances in large language models (LLMs) have further enhanced the performance of embedding models. While these models are often benchmarked on general-purpose datasets, real-world applications demand domain-specific evaluation. In this work, we introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart to MTEB designed for the financial domain. FinMTEB comprises 64 financial domain-specific embedding datasets across 7 tasks that cover diverse textual types in both Chinese and English, such as financial news articles, corporate annual reports, ESG reports, regulatory filings, and earnings call transcripts. We also develop a finance-adapted model, Fin-E5, using a persona-based data synthetic method to cover diverse financial embedding tasks for training. Through extensive evaluation of 15 embedding models, including Fin-E5, we show three key findings: (1) performance on general-purpose benchmarks shows limited correlation with financial domain tasks; (2) domain-adapted models consistently outperform their general-purpose counterparts; and (3) surprisingly, a simple Bag-of-Words (BoW) approach outperforms sophisticated dense embeddings in financial Semantic Textual Similarity (STS) tasks, underscoring current limitations in dense embedding techniques. Our work establishes a robust evaluation framework for financial NLP applications and provides crucial insights for developing domain-specific embedding models.
中文摘要:本文提出金融领域专用基准FinMTEB,通过评估发现领域适配模型优于通用模型,并揭示稠密嵌入在金融语义相似性任务中的现有局限性。
English Summary: This paper introduces FinMTEB, a specialized financial benchmark for evaluating embedding models, and demonstrates that domain-adapted models outperform general ones while revealing surprising limitations of dense embeddings in financial tasks.

Authors:Zongqian Wu, Tianyu Li, Baoduo Xu, Jiaying Yang, Mengmeng Zhan, Xiaofeng Zhu, Lei Feng
Title: Is Depth All You Need? An Exploration of Iterative Reasoning in LLMs
Abstract:
Deep iterative chain-of-thought (CoT) reasoning enables LLMs to tackle complex tasks by progressively activating relevant pre-trained knowledge. However, it faces challenges in ensuring continual improvement and determining a stopping criterion. In this paper, we investigate whether the relevant knowledge that contributes directly to solving the given question can be activated from the initial reasoning path, thus circumventing the need for iterative refinement. Our experiments reveal that increasing the diversity of initial reasoning paths can achieve comparable or superior performance, a concept we term \textit{breadth reasoning}. However, existing breadth reasoning approaches, such as self-consistency, offer limited diversity. To address this limitation, we propose a simple yet effective method that enhances reasoning breadth by integrating contextual exploration with reduced sampling randomness. Extensive experiments demonstrate that our approach significantly outperforms deep iterative reasoning. Our code is provided in https://github.com/zongqianwu/breadth.
Chinese: 深度迭代思维链推理存在持续改进和停止标准的难题,本文提出广度推理方法,通过结合上下文探索与减少采样随机性来多样化初始推理路径,从而显著提升性能。
English: Deep iterative CoT reasoning struggles with continuous improvement and stopping criteria, but this paper introduces breadth reasoning, which enhances performance by diversifying initial reasoning paths through contextual exploration and reduced sampling randomness.

Authors:Lei Sheng, Shuai-Shuai Xu, Wei Xie
Title: BASE-SQL: A powerful open source Text-To-SQL baseline approach
Abstract:
The conversion of natural language into SQL language for querying databases (Text-to-SQL) has broad application prospects and has attracted widespread attention. At present, the mainstream Text-to-SQL methods are mainly divided into in-context learning (ICL) based methods and supervised fine-tuning (SFT) based methods. ICL-based methods can achieve relatively good results thanks to the use of the most advanced closed-source models. However, in real-world application scenarios, factors such as data privacy, SQL generation efficiency and cost need to be considered. SFT-based methods have certain advantages. At present, methods based on fine-tuning of open source models lack easy-to-implement and effective (cost-effective) baseline methods. We propose a pipeline-based method using open source model fine-tuning, referred to as BASE-SQL, which includes four components: Schema Linking, Candidate SQL Generate, SQL Revision and SQL Merge Revision. Experimental results show that BASE-SQL uses the open source model Qwen2.5-Coder-32B-Instruct, and achieves an accuracy of 67.47% on the BIRD development set and 88.9% on the Spider test set, which is significantly better than other methods using open source models, and even exceeds several methods using the GPT-4o closed-source model. At the same time, BASE-SQL is easy to implement and highly efficient (on average, only five calls to the large language model are required to generate SQL once). The code will be open sourced at https://github.com/CycloneBoy/base_sql.
中文: BASE-SQL是一种基于开源模型微调的管道式Text-to-SQL方法,在基准测试中表现出优越的准确率,同时具备高效易实现的优势。
English: BASE-SQL is a pipeline-based Text-to-SQL method using open-source model fine-tuning that achieves superior accuracy on benchmark datasets while being efficient and easy to implement.

Authors:Qiujie Xie, Qingqiu Li, Zhuohao Yu, Yuejie Zhang, Yue Zhang, Linyi Yang
Title: An Empirical Analysis of Uncertainty in Large Language Model Evaluations
Abstract:
As LLM-as-a-Judge emerges as a new paradigm for assessing large language models (LLMs), concerns have been raised regarding the alignment, bias, and stability of LLM evaluators. While substantial work has focused on alignment and bias, little research has concentrated on the stability of LLM evaluators. In this paper, we conduct extensive experiments involving 9 widely used LLM evaluators across 2 different evaluation settings to investigate the uncertainty in model-based LLM evaluations. We pinpoint that LLM evaluators exhibit varying uncertainty based on model families and sizes. With careful comparative analyses, we find that employing special prompting strategies, whether during inference or post-training, can alleviate evaluation uncertainty to some extent. By utilizing uncertainty to enhance LLM's reliability and detection capability in Out-Of-Distribution (OOD) data, we further fine-tune an uncertainty-aware LLM evaluator named ConfiLM using a human-annotated fine-tuning set and assess ConfiLM's OOD evaluation ability on a manually designed test set sourced from the 2024 Olympics. Experimental results demonstrate that incorporating uncertainty as additional information during the fine-tuning phase can largely improve the model's evaluation performance in OOD scenarios. The code and data are released at: https://github.com/hasakiXie123/LLM-Evaluator-Uncertainty.
中文: 本研究探讨了大型语言模型评估者的不确定性,发现模型系列和规模影响稳定性,并提出ConfiLM,一种通过不确定性信息微调的感知不确定性的评估器,以提升在分布外场景下的评估性能。
English: This study investigates the uncertainty in LLM evaluators, finding that model families and sizes affect stability, and proposes ConfiLM, an uncertainty-aware evaluator fine-tuned with uncertainty information to improve performance in out-of-distribution scenarios.

Authors:Zirui Song, Bin Yan, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, Xiuying Chen
Title: Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey
Abstract:
Large Language Models (LLMs) have demonstrated remarkable success in various tasks such as natural language understanding, text summarization, and machine translation. However, their general-purpose nature often limits their effectiveness in domain-specific applications that require specialized knowledge, such as healthcare, chemistry, or legal analysis. To address this, researchers have explored diverse methods to enhance LLMs by integrating domain-specific knowledge. In this survey, we provide a comprehensive overview of these methods, which we categorize into four key approaches: dynamic knowledge injection, static knowledge embedding, modular adapters, and prompt optimization. Each approach offers unique mechanisms to equip LLMs with domain expertise, balancing trade-offs between flexibility, scalability, and efficiency. We discuss how these methods enable LLMs to tackle specialized tasks, compare their advantages and disadvantages, evaluate domain-specific LLMs against general LLMs, and highlight the challenges and opportunities in this emerging field. For those interested in delving deeper into this area, we also summarize the commonly used datasets and benchmarks. To keep researchers updated on the latest studies, we maintain an open-source at: https://github.com/abilliyb/Knowledge_Injection_Survey_Papers, dedicated to documenting research in the field of specialized LLM.
中文摘要:大语言模型在通用任务中表现出色,但在专业领域应用中受限,因此研究者探索了动态知识注入、静态知识嵌入、模块适配器和提示优化四种核心方法,以增强其领域专业知识,同时权衡灵活性、可扩展性和效率。
English Summary: Large Language Models excel in general tasks but struggle with domain-specific applications, prompting researchers to develop four key methods—dynamic knowledge injection, static knowledge embedding, modular adapters, and prompt optimization—to enhance their specialized knowledge while balancing flexibility, scalability, and efficiency.

Authors:Yu-Ang Lee, Ching-Yun Ko, Tejaswini Pedapati, I-Hsin Chung, Mi-Yen Yeh, Pin-Yu Chen
Title: STAR: Spectral Truncation and Rescale for Model Merging
Abstract:
Model merging is an efficient way of obtaining a multi-task model from several pretrained models without further fine-tuning, and it has gained attention in various domains, including natural language processing (NLP). Despite the efficiency, a key challenge in model merging is the seemingly inevitable decrease in task performance as the number of models increases. In this paper, we propose $\mathbf{S}$pectral $\mathbf{T}$runcation $\mathbf{A}$nd $\mathbf{R}$escale (STAR) that aims at mitigating ``merging conflicts'' by truncating small components in the respective spectral spaces, which is followed by an automatic parameter rescaling scheme to retain the nuclear norm of the original matrix. STAR requires no additional inference on original training data and is robust to hyperparamater choice. We demonstrate the effectiveness of STAR through extensive model merging cases on diverse NLP tasks. Specifically, STAR works robustly across varying model sizes, and can outperform baselines by 4.2$\%$ when merging 12 models on Flan-T5. Our code is publicly available at https://github.com/IBM/STAR.
中文: 本文提出的STAR方法通过截断谱空间分量和自动参数重缩放来缓解模型合并中的性能下降问题,在多种自然语言处理任务中无需额外数据即实现稳定性能提升。
English: The paper introduces STAR, a method that reduces performance loss in model merging by truncating spectral components and rescaling parameters, showing robust improvements across NLP tasks without needing extra data or fine-tuning.

Authors:Aivin V. Solatorio, Rafael Macalaba, James Liounis
Title: Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers
Abstract:
Tracking how data is mentioned and used in research papers provides critical insights for improving data discoverability, quality, and production. However, manually identifying and classifying dataset mentions across vast academic literature is resource-intensive and not scalable. This paper presents a machine learning framework that automates dataset mention detection across research domains by leveraging large language models (LLMs), synthetic data, and a two-stage fine-tuning process. We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset. The Phi-3.5-mini instruct model is pre-fine-tuned on this dataset, followed by fine-tuning on a manually annotated subset. At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall. Evaluated on a held-out manually annotated sample, our fine-tuned model outperforms NuExtract-v1.5 and GLiNER-large-v2.1 in dataset extraction accuracy. Our results highlight how LLM-generated synthetic data can effectively address training data scarcity, improving generalization in low-resource settings. This framework offers a pathway toward scalable monitoring of dataset usage, enhancing transparency, and supporting researchers, funders, and policymakers in identifying data gaps and strengthening data accessibility for informed decision-making.
中文: 本文提出一种机器学习框架,利用大语言模型和合成数据自动识别研究论文中的数据集引用,其性能优于现有方法,可提升数据可发现性以支持科学决策。
English: This paper introduces a machine learning framework that automates dataset mention detection in research papers using large language models and synthetic data, outperforming existing methods and enhancing data discoverability for better decision-making.

Authors:Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang Yu, Dapeng Shi, Dingyuan Hu, Enle Liu, Gang Yu, Ge Yang, Guanzhe Huang, Gulin Yan, Haiyang Feng, Hao Nie, Haonan Jia, Hanpeng Hu, Hanqi Chen, Haolong Yan, Heng Wang, Hongcheng Guo, Huilin Xiong, Huixin Xiong, Jiahao Gong, Jianchang Wu, Jiaoren Wu, Jie Wu, Jie Yang, Jiashuai Liu, Jiashuo Li, Jingyang Zhang, Junjing Guo, Junzhe Lin, Kaixiang Li, Lei Liu, Lei Xia, Liang Zhao, Liguo Tan, Liwen Huang, Liying Shi, Ming Li, Mingliang Li, Muhua Cheng, Na Wang, Qiaohui Chen, Qinglin He, Qiuyan Liang, Quan Sun, Ran Sun, Rui Wang, Shaoliang Pang, Shiliang Yang, Sitong Liu, Siqi Liu, Shuli Gao, Tiancheng Cao, Tianyu Wang, Weipeng Ming, Wenqing He, Xu Zhao, Xuelin Zhang, Xianfang Zeng, Xiaojia Liu, Xuan Yang, Yaqi Dai, Yanbo Yu, Yang Li, Yineng Deng, Yingming Wang, Yilei Wang, Yuanwei Lu, Yu Chen, Yu Luo, Yuchu Luo, Yuhe Yin, Yuheng Feng, Yuxiang Yang, Zecheng Tang, Zekai Zhang, Zidong Yang, Binxing Jiao, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu, Heung-Yeung Shum, Daxin Jiang
Title: Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Abstract:
We present Step-Video-T2V, a state-of-the-art text-to-video pre-trained model with 30B parameters and the ability to generate videos up to 204 frames in length. A deep compression Variational Autoencoder, Video-VAE, is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios, while maintaining exceptional video reconstruction quality. User prompts are encoded using two bilingual text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames. A video-based DPO approach, Video-DPO, is applied to reduce artifacts and improve the visual quality of the generated videos. We also detail our training strategies and share key observations and insights. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its state-of-the-art text-to-video quality when compared with both open-source and commercial engines. Additionally, we discuss the limitations of current diffusion-based model paradigm and outline future directions for video foundation models. We make both Step-Video-T2V and Step-Video-T2V-Eval available at https://github.com/stepfun-ai/Step-Video-T2V. The online version can be accessed from https://yuewen.cn/videos as well. Our goal is to accelerate the innovation of video foundation models and empower video content creators.
中文: Step-Video-T2V是一个拥有300亿参数的最先进文本到视频模型,通过深度压缩变分自编码器和三维全注意力扩散变换器,能生成长达204帧的双语高质量视频,在性能评估中展现出业界领先水平。
English: Step-Video-T2V is a 30B-parameter text-to-video model that generates high-quality 204-frame videos using advanced compression and denoising techniques, achieving state-of-the-art performance in bilingual video synthesis.

Authors:Riccardo Bravin, Massimo Pavan, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, Manuel Roveri
Title: EmbBERT-Q: Breaking Memory Barriers in Embedded NLP
Abstract:
Large Language Models (LLMs) have revolutionized natural language processing, setting new standards across a wide range of applications. However, their relevant memory and computational demands make them impractical for deployment on technologically-constrained tiny devices such as wearable devices and Internet-of-Things units. To address this limitation, we introduce EmbBERT-Q, a novel tiny language model specifically designed for tiny devices with stringent memory constraints. EmbBERT-Q achieves state-of-the-art (SotA) accuracy in Natural Language Processing tasks in this scenario, with a total memory footprint (weights and activations) of just 781 kB, representing a 25x reduction in size with respect to SotA models. By combining architectural innovations with hardware-compatible 8-bit quantization, EmbBERT-Q consistently outperforms several baseline models scaled down to a 2 MB memory budget (i.e., the maximum memory typically available in tiny devices), including heavily compressed versions of BERT and MAMBA. Extensive experimental evaluations on both a selected benchmark dataset, TinyNLP, specifically curated to evaluate Tiny Language Models in NLP tasks and real-world scenarios, and the GLUE benchmark, demonstrate EmbBERT-Q ability to deliver competitive accuracy with respect to existing approaches, achieving an unmatched balance between memory and performance. To ensure the complete and immediate reproducibility of all our results, we release all code, scripts, and model checkpoints at https://github.com/RiccardoBravin/tiny-LLM.
中文:EmbBERT-Q是一种专为内存受限微型设备设计的新型微型语言模型,通过架构创新和8位量化技术,仅用781 kB内存占用就实现了最先进的准确率,尺寸缩小了25倍。
English: EmbBERT-Q is a novel tiny language model designed for memory-constrained tiny devices, achieving state-of-the-art accuracy with a 25x size reduction to just 781 kB through architectural innovations and 8-bit quantization.

Authors:Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li
Title: Large Language Diffusion Models
Abstract:
Autoregressive models (ARMs) are widely regarded as the cornerstone of large language models (LLMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA models distributions through a forward data masking process and a reverse process, parameterized by a vanilla Transformer to predict masked tokens. By optimizing a likelihood bound, it provides a principled generative approach for probabilistic inference. Across extensive benchmarks, LLaDA demonstrates strong scalability, outperforming our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings establish diffusion models as a viable and promising alternative to ARMs, challenging the assumption that key LLM capabilities discussed above are inherently tied to ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.
Chinese: LLaDA作为一种基于扩散的模型,通过在语言任务中展现竞争力并解决如逆向诅咒等问题,挑战了自回归模型的地位,为大型语言模型提供了可行的替代方案。
English: LLaDA, a diffusion-based model, challenges autoregressive models by demonstrating competitive performance in language tasks and addressing issues like the reversal curse, offering a viable alternative for large language models.

Authors:Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao
Title: X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
Abstract:
Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.
中文摘要:本研究提出X-Boundary方法,通过精确区分安全与有害特征表示并仅消除后者,在保持大语言模型通用能力的同时,将过度拒绝率降低约20%,实现了对多轮越狱攻击的最优防御效果。
English Summary: The study introduces X-Boundary, a novel defense method that enhances LLM robustness against multi-turn jailbreaks by precisely distinguishing and erasing harmful representations while preserving usability and reducing over-refusal by approximately 20%.

Authors:Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng
Title: LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing
Abstract:
Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications. Our code and dataset is provided at: \href{https://github.com/Alibaba-NLP/LaRA}{\textbf{https://github.com/Alibaba-NLP/LaRA}}.
中文: LaRA基准测试表明,检索增强生成(RAG)与长上下文大模型的选择取决于模型规模、任务类型等多重因素,为有效整合外部知识提供了实用指导。
English: The LaRA benchmark reveals that the choice between Retrieval-Augmented Generation (RAG) and long-context LLMs depends on multiple factors like model size and task type, offering practical guidance for effectively integrating external knowledge into LLMs.

Authors:Ishika Agarwal, Dilek Hakkani-Tür
Title: Neural Networks for Learnable and Scalable Influence Estimation of Instruction Fine-Tuning Data
Abstract:
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.
中文: 本文提出InfluenceNetwork方法,通过小型神经网络高效估算数据影响力值,成本降低高达99%,且性能与传统影响力函数相当。
English: This paper introduces InfluenceNetwork, a method using small neural networks to efficiently estimate data influence values with up to 99% cost reduction while maintaining performance comparable to traditional influence functions.

Authors:Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan
Title: FoNE: Precise Single-Token Number Embeddings via Fourier Features
Abstract:
Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model's performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.

Authors:Qingsong Zou, Jingyu Xiao, Qing Li, Zhi Yan, Yuhang Wang, Li Xu, Wenxuan Wang, Kuofeng Gao, Ruoyu Li, Yong Jiang
Title: QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
Abstract:
Recent advances in large language models (LLMs) have demonstrated remarkable potential in the field of natural language processing. Unfortunately, LLMs face significant security and ethical risks. Although techniques such as safety alignment are developed for defense, prior researches reveal the possibility of bypassing such defenses through well-designed jailbreak attacks. In this paper, we propose QueryAttack, a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language to bypass the safety alignment mechanisms of LLMs. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack not only can achieve high attack success rates (ASRs), but also can jailbreak various defense methods. Furthermore, we tailor a defense method against QueryAttack, which can reduce ASR by up to $64\%$ on GPT-4-1106. Our code is available at https://github.com/horizonsinzqs/QueryAttack.
中文: 本文提出QueryAttack框架,通过将恶意自然语言查询转换为结构化非自然查询来绕过大语言模型的安全对齐机制,不仅能实现高攻击成功率,还设计了可将GPT-4-1106攻击成功率降低64%的防御方案。
English: This paper introduces QueryAttack, a framework that bypasses safety alignment in large language models by converting malicious natural language queries into structured non-natural queries, achieving high attack success rates while also proposing a defense method that reduces attack effectiveness by up to 64% on GPT-4-1106.

Authors:Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Haining Yu, Xiaohua Jia
Title: The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions
Abstract:
Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.
中文摘要:大语言模型的安全对齐行为由激活空间中的多维方向共同控制,次要方向在塑造拒绝行为和揭示通过令牌操作绕过安全能力的脆弱性方面发挥关键作用。
English Summary: Large Language Models' safety behaviors are governed by multi-dimensional activation directions rather than a single vector, with secondary directions playing crucial roles in shaping refusal responses and revealing vulnerabilities through token manipulation.

Authors:Chengqian Gao, Haonan Li, Liu Liu, Zeke Xie, Peilin Zhao, Zhiqiang Xu
Title: Principled Data Selection for Alignment: The Hidden Risks of Difficult Examples
Abstract:
The alignment of large language models (LLMs) often assumes that using more clean data yields better outcomes, overlooking the match between model capacity and example difficulty. Challenging this, we propose a new principle: Preference data vary in difficulty, and overly difficult examples hinder alignment, by exceeding the model's capacity. Through systematic experimentation, we validate this principle with three key findings: (1) preference examples vary in difficulty, as evidenced by consistent learning orders across alignment runs; (2) overly difficult examples significantly degrade performance across four LLMs and two datasets; and (3) the capacity of a model dictates its threshold for handling difficult examples, underscoring a critical relationship between data selection and model capacity. Building on this principle, we introduce Selective DPO, which filters out overly difficult examples. This simple adjustment improves alignment performance by 9-16% in win rates on the AlpacaEval 2 benchmark compared to the DPO baseline, suppressing a series of DPO variants with different algorithmic adjustments. Together, these results illuminate the importance of aligning data difficulty with model capacity, offering a transformative perspective for improving alignment strategies in LLMs. Code is available at https://github.com/glorgao/SelectiveDPO.
中文: 本研究挑战了传统上依赖更多数据对齐大语言模型的做法,通过证明过于困难的示例会损害性能,并提出了选择性DPO方法,通过过滤这些示例将对齐效果在胜率上提升了9-16%。
English: The study challenges the conventional approach of using more data for aligning large language models by demonstrating that overly difficult examples hinder performance and introduces Selective DPO, a method that filters such examples to improve alignment outcomes by 9-16% in win rates.

Authors:Sougata Saha, Saurabh Kumar Pandey, Harshit Gupta, Monojit Choudhury
Title: Reading between the Lines: Can LLMs Identify Cross-Cultural Communication Gaps?
Abstract:
In a rapidly globalizing and digital world, content such as book and product reviews created by people from diverse cultures are read and consumed by others from different corners of the world. In this paper, we investigate the extent and patterns of gaps in understandability of book reviews due to the presence of culturally-specific items and elements that might be alien to users from another culture. Our user-study on 57 book reviews from Goodreads reveal that 83\% of the reviews had at least one culture-specific difficult-to-understand element. We also evaluate the efficacy of GPT-4o in identifying such items, given the cultural background of the reader; the results are mixed, implying a significant scope for improvement. Our datasets are available here: https://github.com/sougata-ub/reading_between_lines
中文: 本研究探讨了Goodreads书评中的文化特定元素如何造成国际读者的理解障碍,发现83%的书评存在此类文化隔阂,同时评估了GPT-4o在识别这些文化参照物方面效果有限。
English: This study examines how culturally-specific elements in book reviews from Goodreads create comprehension gaps for international readers, finding that 83% of reviews contain such barriers, and evaluates GPT-4o's limited effectiveness in identifying these cultural references.

Authors:Yiwen Tang, Zoey Guo, Zhuhao Wang, Ray Zhang, Qizhi Chen, Junli Liu, Delin Qu, Zhigang Wang, Dong Wang, Xuelong Li, Bin Zhao
Title: Exploring the Potential of Encoder-free Architectures in 3D LMMs
Abstract:
Encoder-free architectures have been preliminarily explored in the 2D visual domain, yet it remains an open question whether they can be effectively applied to 3D understanding scenarios. In this paper, we present the first comprehensive investigation into the potential of encoder-free architectures to alleviate the challenges of encoder-based 3D Large Multimodal Models (LMMs). These challenges include the failure to adapt to varying point cloud resolutions and the point features from the encoder not meeting the semantic needs of Large Language Models (LLMs). We identify key aspects for 3D LMMs to remove the encoder and enable the LLM to assume the role of the 3D encoder: 1) We propose the LLM-embedded Semantic Encoding strategy in the pre-training stage, exploring the effects of various point cloud self-supervised losses. And we present the Hybrid Semantic Loss to extract high-level semantics. 2) We introduce the Hierarchical Geometry Aggregation strategy in the instruction tuning stage. This incorporates inductive bias into the LLM layers to focus on the local details of the point clouds. To the end, we present the first Encoder-free 3D LMM, ENEL. Our 7B model rivals the current state-of-the-art model, ShapeLLM-13B, achieving 55.10%, 50.98%, and 43.10% on the classification, captioning, and VQA tasks, respectively. Our results demonstrate that the encoder-free architecture is highly promising for replacing encoder-based architectures in the field of 3D understanding. The code is released at https://github.com/Ivan-Tang-3D/ENEL
中文: 本文提出首个无编码器的3D大模型ENEL,通过预训练阶段的语义编码策略和指令调优阶段的层次几何聚合,使大语言模型直接处理3D点云,在分类、描述和视觉问答任务上达到与更大编码器模型相当的性能。
English: This paper introduces ENEL, the first encoder-free 3D Large Multimodal Model that eliminates traditional 3D encoders by embedding semantic encoding during pre-training and hierarchical geometry aggregation during fine-tuning, achieving performance comparable to larger encoder-based models across classification, captioning, and VQA tasks.

Authors:Yung-Sung Chuang, Benjamin Cohen-Wang, Shannon Zejiang Shen, Zhaofeng Wu, Hu Xu, Xi Victoria Lin, James Glass, Shang-Wen Li, Wen-tau Yih
Title: SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
Abstract:
We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/facebookresearch/SelfCite
中文: SelfCite是一种自监督方法,通过上下文消融生成奖励信号,引导LLM在推理时优化采样和微调,显著提升句子级引用的质量,在基准测试中F1分数最高提升5.3分。
English: SelfCite is a self-supervised method that enhances LLMs' sentence-level citation accuracy by using context ablation to generate rewards, improving citation quality through sampling and fine-tuning, achieving up to a 5.3-point F1 score increase on benchmarks.

Authors:Xinyin Ma, Guangnian Wan, Runpeng Yu, Gongfan Fang, Xinchao Wang
Title: CoT-Valve: Length-Compressible Chain-of-Thought Tuning
Abstract:
Chain-of-Thought significantly enhances a model's reasoning capability, but it also comes with a considerable increase in inference costs due to long chains. With the observation that the reasoning path can be easily compressed under easy tasks but struggle on hard tasks, we explore the feasibility of elastically controlling the length of reasoning paths with only one model, thereby reducing the inference overhead of reasoning models dynamically based on task difficulty. We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths. To achieve this, we propose to identify a direction in the parameter space that, when manipulated, can effectively control the length of generated CoT. Moreover, we show that this property is valuable for compressing the reasoning chain. We construct datasets with chains from long to short for the same questions and explore two enhanced strategies for CoT-Valve: (1) a precise length-compressible CoT tuning method, and (2) a progressive chain length compression approach. Our experiments show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control. We applied this method to QwQ-32B-Preview, reducing reasoning chains on GSM8K from 741 to 225 tokens with a minor performance drop (95.07% to 94.92%) and on AIME from 6827 to 4629 tokens, with only one additional incorrect answer.
中文: CoT-Valve实现了对模型推理链长度的动态控制,有效降低推理开销,在复杂任务上以极小的性能损失显著压缩了推理链长度。
English: CoT-Valve enables dynamic control of reasoning chain lengths in models to reduce inference costs, achieving significant token compression with minimal performance loss on complex tasks.

Authors:Siyan Zhao, Mingyi Hong, Yang Liu, Devamanyu Hazarika, Kaixiang Lin
Title: Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs
Abstract:
Large Language Models (LLMs) are increasingly used as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce PrefEval, a benchmark for evaluating LLMs' ability to infer, memorize and adhere to user preferences in a long-context conversational setting. PrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PrefEval contains user personalization or preference information in both explicit and implicit forms, and evaluates LLM performance using a generation and a classification task. With PrefEval, we evaluated the aforementioned preference following capabilities of 10 open-source and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods. Our benchmarking effort reveals that state-of-the-art LLMs face significant challenges in proactively following users' preferences during conversations. In particular, in zero-shot settings, preference following accuracy falls below 10% at merely 10 turns (~3k tokens) across most evaluated models. Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations. Furthermore, we show that fine-tuning on PrefEval significantly improves performance. We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs' preference following abilities, paving the way for personalized conversational agents. Our code and dataset are available at https://prefeval.github.io/.

Authors:Dexian Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Shi Feng, Yifei Zhang, Soujanya Poria
Title: Pixel-Level Reasoning Segmentation via Multi-turn Conversations
Abstract:
Existing visual perception systems focus on region-level segmentation in single-turn dialogues, relying on complex and explicit query instructions. Such systems cannot reason at the pixel level and comprehend dynamic user intent that changes over interaction. Our work tackles this issue by introducing a novel task, Pixel-level Reasoning Segmentation (Pixel-level RS) based on multi-turn conversations, tracking evolving user intent via multi-turn interactions for fine-grained segmentation. To establish a benchmark for this novel task, we build a Pixel-level ReasonIng Segmentation Dataset Based on Multi-Turn Conversations (PRIST), comprising 24k utterances from 8.3k multi-turn conversational scenarios with segmentation targets. Building on PRIST, we further propose MIRAS, a Multi-turn Interactive ReAsoning Segmentation framework, integrates pixel-level segmentation with robust multi-turn conversation understanding, generating pixel-grounded explanations aligned with user intent. The PRIST dataset and MIRSA framework fill the gap in pixel-level reasoning segmentation. Experimental results on the PRIST dataset demonstrate that our method outperforms current segmentation-specific baselines in terms of segmentation and LLM-based reasoning metrics. The code and data are available at: https://github.com/ccccai239/PixelRIST.
中文: 本研究提出了基于多轮对话的像素级推理分割新任务,通过构建PRIST数据集和MIRAS框架,实现了对动态用户意图的追踪与精细分割,在分割效果和推理指标上均优于现有基准方法。
English: This work introduces Pixel-level Reasoning Segmentation (Pixel-level RS), a novel task that tracks evolving user intent through multi-turn conversations for fine-grained segmentation, supported by the PRIST dataset and MIRAS framework, which outperform existing methods in segmentation and reasoning metrics.

Authors:Daniel Fleischer, Moshe Berchansky, Gad Markovits, Moshe Wasserblat
Title: SQuARE: Sequential Question Answering Reasoning Engine for Enhanced Chain-of-Thought in Large Language Models
Abstract:
In the rapidly evolving field of Natural Language Processing, Large Language Models (LLMs) are tasked with increasingly complex reasoning challenges. Traditional methods like chain-of-thought prompting have shown promise but often fall short in fully leveraging a model's reasoning capabilities. This paper introduces SQuARE (Sequential Question Answering Reasoning Engine), a novel prompting technique designed to improve reasoning through a self-interrogation paradigm. Building upon CoT frameworks, SQuARE prompts models to generate and resolve multiple auxiliary questions before tackling the main query, promoting a more thorough exploration of various aspects of a topic. Our expansive evaluations, conducted with Llama 3 and GPT-4o models across multiple question-answering datasets, demonstrate that SQuARE significantly surpasses traditional CoT prompts and existing rephrase-and-respond methods. By systematically decomposing queries, SQuARE advances LLM capabilities in reasoning tasks. The code is publicly available at https://github.com/IntelLabs/RAG-FiT/tree/square.
Chinese: 本文提出SQuARE这一新颖的自询问提示技术,通过让模型在回答主问题前生成并解决多个辅助问题来增强推理能力,在Llama 3和GPT-4o的评估中显著超越了传统思维链等方法的性能表现。
English: This paper introduces SQuARE, a novel self-interrogation prompting technique that enhances LLM reasoning by generating and resolving auxiliary questions before addressing main queries, significantly outperforming traditional methods like chain-of-thought in evaluations with Llama 3 and GPT-4o.

Authors:Razvan-Gabriel Dumitru, Minglai Yang, Vikas Yadav, Mihai Surdeanu
Title: CopySpec: Accelerating LLMs with Speculative Copy-and-Paste Without Compromising Quality
Abstract:
We introduce CopySpec, a simple yet effective technique to tackle the inefficiencies LLMs face when generating responses that closely resemble previous outputs or responses that can be verbatim extracted from context. CopySpec identifies repeated sequences in the model's chat history or context and speculates that the same tokens will follow, enabling seamless copying without compromising output quality and without requiring additional GPU memory. To evaluate the effectiveness of our approach, we conducted experiments using seven LLMs and five datasets: MT-Bench, CNN/DM, GSM8K, HumanEval, and our newly created dataset, MT-Redundant. MT-Redundant, introduced in this paper, transforms the second turn of MT-Bench into a request for variations of the first turn's answer, simulating real-world scenarios where users request modifications to prior responses. Our results demonstrate significant speed-ups: up to 2.35x on CNN/DM, 3.08x on the second turn of select MT-Redundant categories, and 2.66x on the third turn of GSM8K's self-correction tasks. Importantly, we show that CopySpec integrates seamlessly with speculative decoding, yielding an average 49% additional speed-up over speculative decoding for the second turn of MT-Redundant across all eight categories. While LLMs, even with speculative decoding, suffer from slower inference as context size grows, CopySpec leverages larger contexts to accelerate inference, making it a faster complementary solution. Our code and dataset are publicly available at https://github.com/RazvanDu/CopySpec.
中文: CopySpec 是一种通过识别并复用对话历史中的重复序列来加速大语言模型推理的技术,无需额外显存即可实现显著加速且不损失输出质量。
English: CopySpec is a technique that accelerates LLM inference by identifying and reusing repeated sequences from chat history or context, achieving significant speed-ups without extra GPU memory or quality loss.

Authors:Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari
Title: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation
Abstract:
Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. All resources are publicly available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.
中文: 大型语言模型存在幻觉和知识过时问题,检索增强生成通过整合外部动态信息来缓解,而多模态检索增强生成则融合文本、图像等多类数据以提升效果,但在跨模态对齐和推理方面带来了新挑战。
English: Large Language Models face issues like hallucinations and outdated knowledge, which Retrieval-Augmented Generation addresses by incorporating external dynamic information, and Multimodal RAG further enhances this by integrating multiple data types while presenting unique challenges in cross-modal alignment and reasoning.

Authors:Zhining Liu, Rana Ali Amjad, Ravinarayana Adkathimar, Tianxin Wei, Hanghang Tong
Title: SelfElicit: Your Language Model Secretly Knows Where is the Relevant Evidence
Abstract:
Providing Language Models (LMs) with relevant evidence in the context (either via retrieval or user-provided) can significantly improve their ability to provide better-grounded responses. However, recent studies have found that LMs often struggle to fully comprehend and utilize key evidence from the context, especially when it contains noise and irrelevant information, an issue common in real-world scenarios. To address this, we propose SelfElicit, an inference-time approach that helps LMs focus on key contextual evidence through self-guided explicit highlighting. By leveraging the inherent evidence-finding capabilities of LMs using the attention scores of deeper layers, our method automatically identifies and emphasizes key evidence within the input context, facilitating more accurate and grounded responses without additional training or iterative prompting. We demonstrate that SelfElicit brings consistent and significant improvement on multiple evidence-based QA tasks for various LM families while maintaining computational efficiency. Our code and documentation are available at https://github.com/ZhiningLiu1998/SelfElicit.
Chinese: SelfElicit是一种推理时方法,通过利用深层注意力分数自动识别并突出上下文中的关键证据,帮助语言模型更好地利用上下文信息,从而在基于证据的任务上实现显著提升且无需额外训练。
English: SelfElicit is an inference-time method that enhances Language Models' ability to utilize key contextual evidence by automatically highlighting crucial information using deeper layer attention scores, leading to improved performance on evidence-based tasks without requiring additional training.

Authors:Zhihan Zhang, Shiyang Li, Zixuan Zhang, Xin Liu, Haoming Jiang, Xianfeng Tang, Yifan Gao, Zheng Li, Haodong Wang, Zhaoxuan Tan, Yichuan Li, Qingyu Yin, Bing Yin, Meng Jiang
Title: IHEval: Evaluating Language Models on Following the Instruction Hierarchy
Abstract:
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models' ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
中文摘要:本研究提出了IHEval基准来评估语言模型对指令层级的遵循能力,发现模型在处理冲突指令时表现显著下降,凸显了未来发展中针对性优化的必要性。
English Summary: The study introduces IHEval, a benchmark to evaluate language models' adherence to the instruction hierarchy, revealing their significant difficulty in prioritizing conflicting instructions and highlighting the need for targeted improvements.

Authors:Areeg Fahad Rasheed, M. Zarkoosh, Shimam Amer Chasib, Safa F. Abbas
Title: Data Augmentation to Improve Large Language Models in Food Hazard and Product Detection
Abstract:
The primary objective of this study is to demonstrate the impact of data augmentation using ChatGPT-4o-mini on food hazard and product analysis. The augmented data is generated using ChatGPT-4o-mini and subsequently used to train two large language models: RoBERTa-base and Flan-T5-base. The models are evaluated on test sets. The results indicate that using augmented data helped improve model performance across key metrics, including recall, F1 score, precision, and accuracy, compared to using only the provided dataset. The full code, including model training and the augmented dataset, can be found in this repository: https://github.com/AREEG94FAHAD/food-hazard-prdouct-cls
本研究显示,利用ChatGPT-4o-mini进行数据增强显著提升了RoBERTa-base和Flan-T5-base模型在食品危害与产品分析中的表现,有效改进了召回率、F1分数、精确度和准确率。
This study demonstrates that data augmentation with ChatGPT-4o-mini significantly enhances the performance of RoBERTa-base and Flan-T5-base models in food hazard and product analysis, improving recall, F1 score, precision, and accuracy.

Authors:Miranda Muqing Miao, Michael Kearns
Title: Hallucination, Monofacts, and Miscalibration: An Empirical Investigation
Abstract:
Hallucinated facts in large language models (LLMs) have recently been shown to obey a statistical lower bound determined by the monofact rate (related to the classical Good-Turing missing mass estimator) minus model miscalibration (Kalai & Vempala, 2024). We present the first empirical investigation of this three-way relationship in classical n-gram models and fine-tuned encoder-decoder Transformers. By generating training data from Pareto distributions with varying shape parameters, we systematically control the monofact rates and establish its positive relationship with hallucination. To bridge theory and practice, we derive an empirical analog of the hallucination bound by replacing the population miscalibration term (Section 2.1) with an empirical bin-wise KL divergence and confirm its practical viability. We then introduce selective upweighting -- a simple yet effective technique that strategically repeats as little as 5% of training examples -- to deliberately inject miscalibration into the model. This intervention reduces hallucination by up to 40%, challenging universal deduplication policies. Our experiments reveal a critical trade-off: selective upweighting maintains pre-injection levels of accuracy while substantially reducing hallucination, whereas standard training gradually improves accuracy but fails to address persistently high hallucination, indicating an inherent tension in optimization objectives.
中文: 研究表明,语言模型中的幻觉与单事实率呈正相关,通过选择性加权引入可控的校准偏差,可在保持准确性的同时将幻觉降低达40%,揭示了与标准训练方法之间的优化目标冲突。
English: The study demonstrates that hallucination in language models is positively related to monofact rates and can be reduced by up to 40% through selective upweighting, which introduces controlled miscalibration while maintaining accuracy, revealing a trade-off with standard training methods.

Authors:Soyoung Yoon, Dongha Ahn, Youngwon Lee, Minkyu Jung, HyungJoo Jang, Seung-won Hwang
Title: RoToR: Towards More Reliable Responses for Order-Invariant Inputs
Abstract:
Mitigating positional bias of language models (LMs) for listwise inputs is a well-known and important problem (e.g., lost-in-the-middle). While zero-shot order-invariant LMs have been proposed to solve this issue, their success on practical listwise problems has been limited. In this work, as a first contribution, we identify and overcome two limitations to make zero-shot invariant LMs more practical: (1) training and inference distribution mismatch arising from modifying positional ID assignments to enforce invariance, and (2) failure to adapt to mixture of order-invariant and sensitive inputs in practical listwise problems. Then, to overcome these issues we propose (1) RoToR, a zero-shot invariant LM for genuinely order-invariant inputs with minimal modifications of positional IDs, and (2) Selective Routing, an adaptive framework that handles both order-invariant and order-sensitive inputs in listwise tasks. On the Lost in the middle (LitM), Knowledge Graph QA (KGQA), and MMLU benchmarks, we show that RoToR with Selective Routing can effectively handle practical listwise input tasks in a zero-shot manner (https://github.com/soyoung97/RoToR)
中文摘要:本研究通过提出最小化位置ID修改的RoToR模型和能同时处理顺序不变与顺序敏感输入的Selective Routing框架,解决了零样本顺序不变语言模型的局限性,并在多个基准测试中验证了其有效性。
English Summary: This study addresses limitations in zero-shot order-invariant language models by proposing RoToR with minimal positional ID modifications and Selective Routing to handle both order-invariant and order-sensitive inputs, demonstrating effectiveness on multiple benchmarks.

Authors:Huiyao Chen, Meishan Zhang, Jing Li, Min Zhang, Lilja Øvrelid, Jan Hajič, Hao Fei
Title: Semantic Role Labeling: A Systematical Survey
Abstract:
Semantic role labeling (SRL) is a central natural language processing (NLP) task aiming to understand the semantic roles within texts, facilitating a wide range of downstream applications. While SRL has garnered extensive and enduring research, there is currently a lack of a comprehensive survey that thoroughly organizes and synthesizes the field. This paper aims to review the entire research trajectory of the SRL community over the past two decades. We begin by providing a complete definition of SRL. To offer a comprehensive taxonomy, we categorize SRL methodologies into four key perspectives: model architectures, syntax feature modeling, application scenarios, and multi-modal extensions. Further, we discuss SRL benchmarks, evaluation metrics, and paradigm modeling approaches, while also exploring practical applications across various domains. Finally, we analyze future research directions in SRL, addressing the evolving role of SRL in the age of large language models (LLMs) and its potential impact on the broader NLP landscape. We maintain a public repository and consistently update related resources at: https://github.com/DreamH1gh/Awesome-SRL
中文摘要:本文系统梳理了语义角色标注领域二十年的研究进展,涵盖方法分类、评估体系、实际应用,并探讨了该技术在大语言模型时代的发展方向。
English Summary: This paper provides a comprehensive survey of semantic role labeling (SRL) research over the past two decades, covering methodology taxonomy, benchmarks, applications, and future directions in the context of large language models.

Authors:Yuchang Zhu, Huizhe Zhang, Bingzhe Wu, Jintang Li, Zibin Zheng, Peilin Zhao, Liang Chen, Yatao Bian
Title: Measuring Diversity in Synthetic Datasets
Abstract:
Large language models (LLMs) are widely adopted to generate synthetic datasets for various natural language processing (NLP) tasks, such as text classification and summarization. However, accurately measuring the diversity of these synthetic datasets-an aspect crucial for robust model performance-remains a significant challenge. In this paper, we introduce DCScore, a novel method for measuring synthetic dataset diversity from a classification perspective. Specifically, DCScore formulates diversity evaluation as a sample classification task, leveraging mutual relationships among samples. We further provide theoretical verification of the diversity-related axioms satisfied by DCScore, highlighting its role as a principled diversity evaluation method. Experimental results on synthetic datasets reveal that DCScore enjoys a stronger correlation with multiple diversity pseudo-truths of evaluated datasets, underscoring its effectiveness. Moreover, both empirical and theoretical evidence demonstrate that DCScore substantially reduces computational costs compared to existing methods. Code is available at: https://github.com/bluewhalelab/dcscore.
Chinese: 本文提出DCScore这一新方法,通过将多样性评估构建为样本分类任务来衡量大语言模型生成的合成数据集的多样性,实验证明该方法与多样性基准相关性更强,且相比现有方法显著降低了计算成本。
English: This paper introduces DCScore, a novel method for evaluating the diversity of synthetic datasets generated by large language models by framing it as a classification task, which demonstrates stronger correlation with diversity benchmarks and reduces computational costs compared to existing approaches.

Authors:Qifan Yu, Zhenyu He, Sijie Li, Xun Zhou, Jun Zhang, Jingjing Xu, Di He
Title: Enhancing Auto-regressive Chain-of-Thought through Loop-Aligned Reasoning
Abstract:
Chain-of-Thought (CoT) prompting has emerged as a powerful technique for enhancing language model's reasoning capabilities. However, generating long and correct CoT trajectories is challenging. Recent studies have demonstrated that Looped Transformers possess remarkable length generalization capabilities, but their limited generality and adaptability prevent them from serving as an alternative to auto-regressive solutions. To better leverage the strengths of Looped Transformers, we propose RELAY (REasoning through Loop Alignment iterativelY). Specifically, we align the steps of Chain-of-Thought (CoT) reasoning with loop iterations and apply intermediate supervision during the training of Looped Transformers. This additional iteration-wise supervision not only preserves the Looped Transformer's ability for length generalization but also enables it to predict CoT reasoning steps for unseen data. Therefore, we leverage this Looped Transformer to generate accurate reasoning chains for complex problems that exceed the training length, which will then be used to fine-tune an auto-regressive model. We conduct extensive experiments, and the results demonstrate the effectiveness of our approach, with significant improvements in the performance of the auto-regressive model. Code will be released at https://github.com/qifanyu/RELAY.
Chinese: RELAY通过将思维链推理步骤与循环Transformer的迭代对齐,使其能够为超出训练长度的复杂问题生成准确推理链,进而用于微调自回归模型以提升性能。
English: RELAY aligns CoT reasoning steps with loop iterations in Looped Transformers, enabling them to generate accurate reasoning chains for complex problems beyond training length, which are then used to fine-tune auto-regressive models for improved performance.

Authors:Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
Title: mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
Abstract:
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.
Chinese: 本研究提出了高质量合成多模态数据的三个标准——广泛覆盖、强跨模态对齐和高保真度,并据此生成数据集训练mmE5模型,在多个基准测试中取得了最优性能。
English: This study establishes three criteria for high-quality synthetic multimodal data—broad scope, robust cross-modal alignment, and high fidelity—and uses them to create datasets that train the mmE5 model, achieving state-of-the-art performance on benchmarks.

Authors:Prajwal Gatti, Kshitij Parikh, Dhriti Prasanna Paul, Manish Gupta, Anand Mishra
Title: Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
Abstract:
Non-native speakers with limited vocabulary often struggle to name specific objects despite being able to visualize them, e.g., people outside Australia searching for numbats. Further, users may want to search for such elusive objects with difficult-to-sketch interactions, e.g., numbat digging in the ground. In such common but complex situations, users desire a search interface that accepts composite multimodal queries comprising hand-drawn sketches of difficult-to-name but easy-to-draw objects and text describing difficult-to-sketch but easy-to-verbalize object attributes or interaction with the scene. This novel problem statement distinctly differs from the previously well-researched TBIR (text-based image retrieval) and SBIR (sketch-based image retrieval) problems. To study this under-explored task, we curate a dataset, CSTBIR (Composite Sketch+Text Based Image Retrieval), consisting of approx. 2M queries and 108K natural scene images. Further, as a solution to this problem, we propose a pretrained multimodal transformer-based baseline, STNET (Sketch+Text Network), that uses a hand-drawn sketch to localize relevant objects in the natural scene image, and encodes the text and image to perform image retrieval. In addition to contrastive learning, we propose multiple training objectives that improve the performance of our model. Extensive experiments show that our proposed method outperforms several state-of-the-art retrieval methods for text-only, sketch-only, and composite query modalities. We make the dataset and code available at our project website.

Authors:Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, Qingyun Pan
Title: SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
Abstract:
As a powerful all-weather Earth observation tool, synthetic aperture radar (SAR) remote sensing enables critical military reconnaissance, maritime surveillance, and infrastructure monitoring. Although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified. The project will be released at https://github.com/JimmyMa99/SARChat.
中文摘要:本文创新性地提出了首个面向SAR图像的大规模多模态对话数据集SARChat-2M,包含约200万高质量图文对,为评估和提升视觉语言模型在专业遥感领域的解析能力提供了基准框架。
English Summary: This paper introduces SARChat-2M, the first large-scale multimodal dialogue dataset for SAR images containing 2 million image-text pairs, which enables and evaluates vision-language models' capabilities in SAR interpretation across diverse scenarios.

Authors:Víctor Gallego
Title: MetaSC: Test-Time Safety Specification Optimization for Language Models
Abstract:
We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code released at https://github.com/vicgalle/meta-self-critique.git .
中文: 本文提出了一种新颖的动态安全框架,通过元批判机制在推理时迭代优化安全提示,无需修改模型权重即可显著提升语言模型对抗恶意攻击和通用安全任务的表现。
English: This paper introduces a dynamic safety framework that enhances language model safety through a meta-critique mechanism, which iteratively refines safety prompts during inference to improve performance against adversarial attacks and general safety tasks without altering model weights.

Authors:Zach Nussbaum, Brandon Duderstadt
Title: Training Sparse Mixture Of Experts Text Embedding Models
Abstract:
Transformer-based text embedding models have improved their performance on benchmarks like MIRACL and BEIR by increasing their parameter counts. However, this scaling approach introduces significant deployment challenges, including increased inference latency and memory usage. These challenges are particularly severe in retrieval-augmented generation (RAG) applications, where large models' increased memory requirements constrain dataset ingestion capacity, and their higher latency directly impacts query-time performance. While causal language models have addressed similar efficiency challenges using Mixture of Experts (MoE) architectures, this approach hasn't been successfully adapted to the general text embedding setting. In this paper, we introduce Nomic Embed v2, the first general purpose MoE text embedding model. Our model outperforms models in the same parameter class on both monolingual and multilingual benchmarks while also maintaining competitive performance with models twice its size. We open-source all code, models, and evaluation data to ensure full reproducibility of our training pipeline at \href{https://github.com/nomic-ai/contrastors}{https://github.com/nomic-ai/contrastors}.
中文: 基于Transformer的文本嵌入模型因规模扩大面临部署难题,而新型Nomic Embed v2作为首个MoE架构模型,性能超越同类并媲美更大模型,同时开源确保可复现性。
English: Transformer-based text embedding models face deployment challenges due to scaling, but the new Nomic Embed v2, the first MoE-based model, outperforms peers and matches larger models while being open-sourced for reproducibility.

Authors:Shengkun Tang, Oliver Sieberling, Eldar Kurtic, Zhiqiang Shen, Dan Alistarh
Title: DarwinLM: Evolutionary Structured Pruning of Large Language Models
Abstract:
Large Language Models (LLMs) have achieved significant success across various NLP tasks. However, their massive computational costs limit their widespread use, particularly in real-time applications. Structured pruning offers an effective solution by compressing models and directly providing end-to-end speed improvements, regardless of the hardware environment. Meanwhile, different components of the model exhibit varying sensitivities towards pruning, calling for non-uniform model compression. However, a pruning method should not only identify a capable substructure, but also account for post-compression training. To this end, we propose DarwinLM, a method for training-aware structured pruning. DarwinLM builds upon an evolutionary search process, generating multiple offspring models in each generation through mutation, and selecting the fittest for survival. To assess the effect of post-training, we incorporate a lightweight, multistep training process within the offspring population, progressively increasing the number of tokens and eliminating poorly performing models in each selection stage. We validate our method through extensive experiments on Llama-2-7B, Llama-3.1-8B and Qwen-2.5-14B-Instruct, achieving state-of-the-art performance for structured pruning. For instance, DarwinLM surpasses ShearedLlama while requiring 5x less training data during post-compression training. Code is at: https://github.com/IST-DASLab/DarwinLM
Chinese: DarwinLM是一种训练感知的结构化剪枝方法,通过进化搜索和多阶段训练,在降低计算成本的同时高效压缩大语言模型并保持优异性能。
English: DarwinLM is a training-aware structured pruning method that uses evolutionary search and multistep training to efficiently compress large language models while maintaining high performance with reduced computational costs.

Authors:Jiacong Xu, Shao-Yuan Lo, Bardia Safaei, Vishal M. Patel, Isht Dwivedi
Title: Towards Zero-Shot Anomaly Detection and Reasoning with Multimodal Large Language Models
Abstract:
Zero-Shot Anomaly Detection (ZSAD) is an emerging AD paradigm. Unlike the traditional unsupervised AD setting that requires a large number of normal samples to train a model, ZSAD is more practical for handling data-restricted real-world scenarios. Recently, Multimodal Large Language Models (MLLMs) have shown revolutionary reasoning capabilities in various vision tasks. However, the reasoning of image abnormalities remains underexplored due to the lack of corresponding datasets and benchmarks. To facilitate research in AD & reasoning, we establish the first visual instruction tuning dataset, Anomaly-Instruct-125k, and the evaluation benchmark, VisA-D&R. Through investigation with our benchmark, we reveal that current MLLMs like GPT-4o cannot accurately detect and describe fine-grained anomalous details in images. To address this, we propose Anomaly-OneVision (Anomaly-OV), the first specialist visual assistant for ZSAD and reasoning. Inspired by human behavior in visual inspection, Anomaly-OV leverages a Look-Twice Feature Matching (LTFM) mechanism to adaptively select and emphasize abnormal visual tokens. Extensive experiments demonstrate that Anomaly-OV achieves significant improvements over advanced generalist models in both detection and reasoning. Extensions to medical and 3D AD are provided for future study. The link to our project page: https://xujiacong.github.io/Anomaly-OV/

Authors:Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li
Title: DPO-Shift: Shifting the Distribution of Direct Preference Optimization
Abstract:
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
中文: 本文提出DPO-Shift方法,通过可控地调整优选响应概率分布来解决直接偏好优化中的似然偏移问题,实验证明该方法在保持奖励边界与提升优选概率间存在权衡关系,并在下游任务中优于原始DPO算法。
English: This paper introduces DPO-Shift, a method to address the likelihood displacement issue in Direct Preference Optimization by controllably shifting the chosen response probability, demonstrating its superiority over DPO through improved performance on downstream tasks and a trade-off between chosen probability and reward margin.

Authors:Cong Lu, Shengran Hu, Jeff Clune
Title: Automated Capability Discovery via Foundation Model Self-Exploration
Abstract:
Foundation models have become general-purpose assistants, exhibiting diverse capabilities across numerous domains through training on web-scale data. It remains challenging to precisely characterize even a fraction of the full spectrum of these abilities and potential risks in any new model. Existing evaluation approaches often require significant human effort, and it is taking increasing effort to design ever harder challenges for more capable models. We introduce Automated Capability Discovery (ACD), a framework that designates one foundation model as a scientist to systematically propose open-ended tasks probing the abilities of a subject model (potentially itself). By combining frontier models with ideas from the field of open-endedness, ACD automatically and systematically uncovers a diverse spectrum of surprising capabilities and failures in the subject model. We demonstrate ACD across a range of foundation models (including the GPT, Claude, and Llama series), showing that it automatically generates thousands of distinct tasks, which are then clustered to reveal dozens of broader capability areas and failure modes, that would be challenging for any single team to uncover. We further validate our method's automated scoring with extensive human surveys, observing high agreement between model-generated and human evaluations. By leveraging foundation models' ability to both create tasks and self-evaluate, ACD is a significant step toward scalable, automated evaluation of novel AI systems. All code and evaluation logs are open-sourced at https://github.com/conglu1997/ACD.
中文摘要:自动化能力发现(ACD)框架将一个基础模型作为科学家,通过开放式探索和自我评估,系统地生成和评估任务,自动揭示目标模型的广泛能力与风险。
English Summary: The Automated Capability Discovery (ACD) framework employs one foundation model as a scientist to systematically generate and evaluate tasks, automatically uncovering a wide range of capabilities and risks in subject models through open-ended exploration and self-assessment.

Authors:Fu-An Chao, Berlin Chen
Title: Towards Efficient and Multifaceted Computer-assisted Pronunciation Training Leveraging Hierarchical Selective State Space Model and Decoupled Cross-entropy Loss
Abstract:
Prior efforts in building computer-assisted pronunciation training (CAPT) systems often treat automatic pronunciation assessment (APA) and mispronunciation detection and diagnosis (MDD) as separate fronts: the former aims to provide multiple pronunciation aspect scores across diverse linguistic levels, while the latter focuses instead on pinpointing the precise phonetic pronunciation errors made by non-native language learners. However, it is generally expected that a full-fledged CAPT system should perform both functionalities simultaneously and efficiently. In response to this surging demand, we in this work first propose HMamba, a novel CAPT approach that seamlessly integrates APA and MDD tasks in parallel. In addition, we introduce a novel loss function, decoupled cross-entropy loss (deXent), specifically tailored for MDD to facilitate better-supervised learning for detecting mispronounced phones, thereby enhancing overall performance. A comprehensive set of empirical results on the speechocean762 benchmark dataset demonstrates the effectiveness of our approach on APA. Notably, our proposed approach also yields a considerable improvement in MDD performance over a strong baseline, achieving an F1-score of 63.85%. Our codes are made available at https://github.com/Fuann/hmamba
中文摘要:本文提出HMamba新型计算机辅助发音训练系统,通过并行整合发音自动评估与误读检测诊断功能,并采用新型解耦交叉熵损失函数,在基准数据集上实现了性能的显著提升。
English Summary: This paper introduces HMamba, a novel CAPT system that integrates automatic pronunciation assessment and mispronunciation detection in parallel, enhanced by a new loss function that significantly improves performance on benchmark datasets.

Authors:Weigao Sun, Disen Lan, Yiran Zhong, Xiaoye Qu, Yu Cheng
Title: LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid
Abstract:
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths. However, existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy, which results in lower computation parallelism, limits their scalability for longer sequences in distributed systems. In this paper, we introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models with very-long input sequences. Compared to previous work LASP, LASP-2 rethinks the minimal communication requirement for SP on linear attention layers, reorganizes the whole communication-computation workflow of LASP. In this way, only one single AllGather collective communication is needed on intermediate memory states, whose sizes are independent of the sequence length, leading to significant improvements of both communication and computation parallelism, as well as their overlap. Additionally, we extend LASP-2 to LASP-2H by applying similar communication redesign to standard attention modules, offering an efficient SP solution for hybrid models that blend linear and standard attention layers. Our evaluation on a Linear-Llama3 model, a variant of Llama3 with linear attention replacing standard attention, demonstrates the effectiveness of LASP-2 and LASP-2H. Specifically, LASP-2 achieves training speed improvements of 15.2% over LASP and 36.6% over Ring Attention, with a sequence length of 2048K across 64 GPUs. The Code is released as a part of: https://github.com/OpenSparseLLMs/Linear-MoE.
中文: 本文提出LASP-2这一新型序列并行方法,通过重构通信计算流程仅需单次AllGather操作,显著提升了长序列线性注意力Transformer训练的通信与计算并行效率。
English: This paper introduces LASP-2, a novel sequence parallelism method that enhances communication and computation parallelism for training linear attention transformers with long sequences by minimizing communication overhead through a redesigned workflow requiring only one AllGather operation.

Authors:Viacheslav Vasilev, Julia Agafonova, Nikolai Gerasimenko, Alexander Kapitanov, Polina Mikhailova, Evelina Mironova, Denis Dimitrov
Title: RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation
Abstract:
Text-to-image generation models have gained popularity among users around the world. However, many of these models exhibit a strong bias toward English-speaking cultures, ignoring or misrepresenting the unique characteristics of other language groups, countries, and nationalities. The lack of cultural awareness can reduce the generation quality and lead to undesirable consequences such as unintentional insult, and the spread of prejudice. In contrast to the field of natural language processing, cultural awareness in computer vision has not been explored as extensively. In this paper, we strive to reduce this gap. We propose a RusCode benchmark for evaluating the quality of text-to-image generation containing elements of the Russian cultural code. To do this, we form a list of 19 categories that best represent the features of Russian visual culture. Our final dataset consists of 1250 text prompts in Russian and their translations into English. The prompts cover a wide range of topics, including complex concepts from art, popular culture, folk traditions, famous people's names, natural objects, scientific achievements, etc. We present the results of a human evaluation of the side-by-side comparison of Russian visual concepts representations using popular generative models.
中文:文本到图像生成模型常偏向英语文化,因此我们提出RusCode基准,通过涵盖俄罗斯文化特征的人类评估提示来提升其文化表现力。
English: Text-to-image models often exhibit cultural bias favoring English-speaking contexts, prompting the development of the RusCode benchmark to evaluate and improve their representation of Russian cultural elements through human-assessed prompts.

Authors:Duong Anh Kiet
Title: Hierarchical Document Parsing via Large Margin Feature Matching and Heuristics
Abstract:
We present our solution to the AAAI-25 VRD-IU challenge, achieving first place in the competition. Our approach integrates large margin loss for improved feature discrimination and employs heuristic rules to refine hierarchical relationships. By combining a deep learning-based matching strategy with greedy algorithms, we achieve a significant boost in accuracy while maintaining computational efficiency. Our method attains an accuracy of 0.98904 on the private leaderboard, demonstrating its effectiveness in document structure parsing. Source codes are publicly available at https://github.com/ffyyytt/VRUID-AAAI-DAKiet
我们的方案在AAAI-25 VRD-IU挑战赛中夺冠,通过结合大间隔损失提升特征区分度和启发式规则优化层级关系,以0.98904的准确率实现了高效计算。
Our solution won the AAAI-25 VRD-IU challenge by integrating large margin loss for better feature discrimination and heuristic rules to refine hierarchical relationships, achieving 0.98904 accuracy with efficient computation.

Authors:Zican Dong, Junyi Li, Jinhao Jiang, Mingyu Xu, Wayne Xin Zhao, Bingning Wang, Weipeng Chen
Title: LongReD: Mitigating Short-Text Degradation of Long-Context Large Language Models via Restoration Distillation
Abstract:
Large language models (LLMs) have gained extended context windows through scaling positional encodings and lightweight continual pre-training. However, this often leads to degraded performance on short-text tasks, while the reasons for this degradation remain insufficiently explored. In this work, we identify two primary factors contributing to this issue: distribution drift in hidden states and attention scores, and catastrophic forgetting during continual pre-training. To address these challenges, we propose Long Context Pre-training with Restoration Distillation (LongReD), a novel approach designed to mitigate short-text performance degradation through minimizing the distribution discrepancy between the extended and original models. Besides training on long texts, LongReD distills the hidden state of selected layers from the original model on short texts. Additionally, LongReD also introduces a short-to-long distillation, aligning the output distribution on short texts with that on long texts by leveraging skipped positional indices. Experiments on common text benchmarks demonstrate that LongReD effectively preserves the model's short-text performance while maintaining comparable or even better capacity to handle long texts than baselines. Our code is available at https://github.com/RUCAIBox/LongReD.
中文摘要:LongReD方法通过恢复蒸馏技术解决大语言模型扩展上下文窗口时出现的分布漂移和灾难性遗忘问题,有效缓解了短文本任务性能下降,同时保持长文本处理能力。
English Summary: The LongReD method is introduced to counteract the performance decline of large language models on short-text tasks when their context windows are expanded, by addressing distribution drift and catastrophic forgetting through restoration distillation techniques.

Authors:Zilu Dong, Xiangqing Shen, Rui Xia
Title: MEMIT-Merge: Addressing MEMIT's Key-Value Conflicts in Same-Subject Batch Editing for LLMs
Abstract:
As large language models continue to scale up, knowledge editing techniques that modify models' internal knowledge without full retraining have gained significant attention. MEMIT, a prominent batch editing algorithm, stands out for its capability to perform mass knowledge modifications. However, we uncover that MEMIT's editing efficacy significantly deteriorates when processing batches containing multiple edits sharing the same subject. Our analysis reveals this stems from MEMIT's key value modeling framework: identical keys (derived from the shared subject) are forced to represent different values (corresponding to different knowledge), resulting in update conflicts during editing. Addressing this issue, we propose MEMIT-Merge, an enhanced approach that merges value computation processes for facts sharing the same subject, effectively resolving the performance degradation in samesubject batch editing scenarios. Experimental results demonstrate that when MEMIT's edit success rate drops to around 50% at larger batch sizes, MEMIT-Merge maintains a success rate exceeding 90%, showcasing remarkable robustness to subject entity collisions. The code is available at https://github.com/NUSTM/ MEMIT-Merge.
中文: MEMIT算法在处理同主体批量知识编辑时,因相同键值对应不同知识导致更新冲突,性能显著下降至约50%成功率;而提出的MEMIT-Merge方法通过合并同主体事实的值计算过程,将成功率稳定保持在90%以上,有效解决了该问题。
English: MEMIT, a batch knowledge editing method for large language models, suffers performance degradation when handling multiple edits with the same subject due to conflicting key-value updates, but the proposed MEMIT-Merge enhancement resolves this by merging value computations, maintaining over 90% success rate versus MEMIT's 50% drop.

Authors:Junlong Li, Daya Guo, Dejian Yang, Runxin Xu, Yu Wu, Junxian He
Title: CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Abstract:
Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives -- like logic flow planning, state-space searching, decision tree traversal, and modular decomposition -- while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models are available at https://github.com/hkust-nlp/CodeIO.
中文摘要:CodeI/O方法通过将代码转化为自然语言的输入输出预测,使语言模型学习通用推理模式,从而在多种推理任务中实现性能提升。
English Summary: The CodeI/O method enhances reasoning in language models by transforming code into natural language input-output predictions, exposing universal reasoning patterns and improving performance across diverse tasks.

Authors:Yelin Chen, Fanjin Zhang, Jie Tang
Title: Small Language Model Makes an Effective Long Text Extractor
Abstract:
Named Entity Recognition (NER) is a fundamental problem in natural language processing (NLP). However, the task of extracting longer entity spans (e.g., awards) from extended texts (e.g., homepages) is barely explored. Current NER methods predominantly fall into two categories: span-based methods and generation-based methods. Span-based methods require the enumeration of all possible token-pair spans, followed by classification on each span, resulting in substantial redundant computations and excessive GPU memory usage. In contrast, generation-based methods involve prompting or fine-tuning large language models (LLMs) to adapt to downstream NER tasks. However, these methods struggle with the accurate generation of longer spans and often incur significant time costs for effective fine-tuning. To address these challenges, this paper introduces a lightweight span-based NER method called SeNER, which incorporates a bidirectional arrow attention mechanism coupled with LogN-Scaling on the [CLS] token to embed long texts effectively, and comprises a novel bidirectional sliding-window plus-shaped attention (BiSPA) mechanism to reduce redundant candidate token-pair spans significantly and model interactions between token-pair spans simultaneously. Extensive experiments demonstrate that our method achieves state-of-the-art extraction accuracy on three long NER datasets and is capable of extracting entities from long texts in a GPU-memory-friendly manner. Code: https://github.com/THUDM/scholar-profiling/tree/main/sener
中文: 本文提出轻量级跨度命名实体识别方法SeNER,通过创新的注意力机制有效处理长文本中的实体跨度,在实现最先进抽取精度的同时保持GPU内存友好性。
English: This paper introduces SeNER, a lightweight span-based NER method that effectively handles long entity spans in extended texts through innovative attention mechanisms, achieving state-of-the-art accuracy while being GPU-memory-efficient.

Authors:Wei Wu, Qiuyi Li, Mingyang Li, Kun Fu, Fuli Feng, Jieping Ye, Hui Xiong, Zheng Wang
Title: GENERator: A Long-Context Generative Genomic Foundation Model
Abstract:
Advancements in DNA sequencing technologies have significantly improved our ability to decode genomic sequences. However, the prediction and interpretation of these sequences remain challenging due to the intricate nature of genetic material. Large language models (LLMs) have introduced new opportunities for biological sequence analysis. Recent developments in genomic language models have underscored the potential of LLMs in deciphering DNA sequences. Nonetheless, existing models often face limitations in robustness and application scope, primarily due to constraints in model structure and training data scale. To address these limitations, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of eukaryotic DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences that translate into proteins structurally analogous to known families. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles. These capabilities position the GENERator as a pivotal tool for genomic research and biotechnological advancement, enhancing our ability to interpret and predict complex biological systems and enabling precise genomic interventions. Implementation details and supplementary resources are available at https://github.com/GenerTeam/GENERator.
中文: GENERator是一个拥有12亿参数和98千碱基对上下文长度的生成式基因组基础模型,基于3860亿碱基对真核DNA训练,在生成蛋白质编码序列和优化具有特定活性增强子序列方面表现卓越,为基因组研究和生物技术提供了关键工具。
English: The GENERator is a generative genomic foundation model with 1.2B parameters and a 98k bp context length, trained on 386B bp of eukaryotic DNA, achieving state-of-the-art performance in generating protein-coding sequences and optimizing enhancer sequences for genomic research and biotechnology.

Authors:Xuefeng Liu, Songhao Jiang, Siyu Chen, Zhuoran Yang, Yuxin Chen, Ian Foster, Rick Stevens
Title: DrugImproverGPT: A Large Language Model for Drug Optimization with Fine-Tuning via Structured Policy Optimization
Abstract:
Finetuning a Large Language Model (LLM) is crucial for generating results towards specific objectives. This research delves into the realm of drug optimization and introduce a novel reinforcement learning algorithm to finetune a drug optimization LLM-based generative model, enhancing the original drug across target objectives, while retains the beneficial chemical properties of the original drug. This work is comprised of two primary components: (1) DrugImprover: A framework tailored for improving robustness and efficiency in drug optimization. It includes a LLM designed for drug optimization and a novel Structured Policy Optimization (SPO) algorithm, which is theoretically grounded. This algorithm offers a unique perspective for fine-tuning the LLM-based generative model by aligning the improvement of the generated molecule with the input molecule under desired objectives. (2) A dataset of 1 million compounds, each with OEDOCK docking scores on 5 human proteins associated with cancer cells and 24 binding sites from SARS-CoV-2 virus. We conduct a comprehensive evaluation of SPO and demonstrate its effectiveness in improving the original drug across target properties. Our code and dataset will be publicly available at: https://github.com/xuefeng-cs/DrugImproverGPT.
中文摘要:本研究提出了一种新颖的强化学习算法——结构化策略优化(SPO),用于微调药物优化大语言模型,在提升目标特性的同时保持原始药物的有益化学性质。
English Summary: This research introduces a reinforcement learning algorithm called Structured Policy Optimization (SPO) to fine-tune a drug optimization LLM, improving target properties while preserving beneficial chemical characteristics of original drugs.

Authors:Elias Lumer, Pradeep Honaganahalli Basavaraju, Myles Mason, James A. Burke, Vamse Kumar Subbiah
Title: Graph RAG-Tool Fusion
Abstract:
Recent developments in retrieval-augmented generation (RAG) for selecting relevant tools from a tool knowledge base enable LLM agents to scale their complex tool calling capabilities to hundreds or thousands of external tools, APIs, or agents-as-tools. However, traditional RAG-based tool retrieval fails to capture structured dependencies between tools, limiting the retrieval accuracy of a retrieved tool's dependencies. For example, among a vector database of tools, a "get stock price" API requires a "stock ticker" parameter from a "get stock ticker" API, and both depend on OS-level internet connectivity tools. In this paper, we address this limitation by introducing Graph RAG-Tool Fusion, a novel plug-and-play approach that combines the strengths of vector-based retrieval with efficient graph traversal to capture all relevant tools (nodes) along with any nested dependencies (edges) within the predefined tool knowledge graph. We also present ToolLinkOS, a new tool selection benchmark of 573 fictional tools, spanning over 15 industries, each with an average of 6.3 tool dependencies. We demonstrate that Graph RAG-Tool Fusion achieves absolute improvements of 71.7% and 22.1% over naïve RAG on ToolLinkOS and ToolSandbox benchmarks, respectively (mAP@10). ToolLinkOS dataset is available at https://github.com/EliasLumer/Graph-RAG-Tool-Fusion-ToolLinkOS
中文: 本文提出Graph RAG-Tool Fusion方法,通过结合向量检索与图遍历技术来捕捉工具间依赖关系,在新型基准测试上相比传统RAG实现了显著性能提升。
English: This paper introduces Graph RAG-Tool Fusion, a plug-and-play method that enhances tool retrieval by combining vector-based search with graph traversal to capture tool dependencies, achieving significant improvements over traditional RAG on new benchmarks.

Authors:Girish A. Koushik, Diptesh Kanojia, Helen Treharne
Title: Towards a Robust Framework for Multimodal Hate Detection: A Study on Video vs. Image-based Content
Abstract:
Social media platforms enable the propagation of hateful content across different modalities such as textual, auditory, and visual, necessitating effective detection methods. While recent approaches have shown promise in handling individual modalities, their effectiveness across different modality combinations remains unexplored. This paper presents a systematic analysis of fusion-based approaches for multimodal hate detection, focusing on their performance across video and image-based content. Our comprehensive evaluation reveals significant modality-specific limitations: while simple embedding fusion achieves state-of-the-art performance on video content (HateMM dataset) with a 9.9% points F1-score improvement, it struggles with complex image-text relationships in memes (Hateful Memes dataset). Through detailed ablation studies and error analysis, we demonstrate how current fusion approaches fail to capture nuanced cross-modal interactions, particularly in cases involving benign confounders. Our findings provide crucial insights for developing more robust hate detection systems and highlight the need for modality-specific architectural considerations. The code is available at https://github.com/gak97/Video-vs-Meme-Hate.
Chinese: 本研究系统评估了基于融合的多模态仇恨内容检测方法,发现简单嵌入融合在视频内容上表现优异,但在处理表情包中复杂的图文关系时存在不足,因其难以捕捉细微的跨模态交互特征。
English: This study systematically evaluates fusion-based approaches for multimodal hate detection, revealing that while simple embedding fusion excels with video content, it struggles with complex image-text relationships in memes due to limitations in capturing nuanced cross-modal interactions.

Authors:Danrui Li, Sen Zhang, Sam S. Sohn, Kaidong Hu, Muhammad Usman, Mubbasir Kapadia
Title: Cardiverse: Harnessing LLMs for Novel Card Game Prototyping
Abstract:
The prototyping of computer games, particularly card games, requires extensive human effort in creative ideation and gameplay evaluation. Recent advances in Large Language Models (LLMs) offer opportunities to automate and streamline these processes. However, it remains challenging for LLMs to design novel game mechanics beyond existing databases, generate consistent gameplay environments, and develop scalable gameplay AI for large-scale evaluations. This paper addresses these challenges by introducing a comprehensive automated card game prototyping framework. The approach highlights a graph-based indexing method for generating novel game variations, an LLM-driven system for consistent game code generation validated by gameplay records, and a gameplay AI constructing method that uses an ensemble of LLM-generated heuristic functions optimized through self-play. These contributions aim to accelerate card game prototyping, reduce human labor, and lower barriers to entry for game developers. For code repo visit this http URL https://github.com/danruili/Cardiverse
中文: 本文提出了一种自动化卡牌游戏原型框架,通过基于图的索引和大型语言模型驱动的系统,生成新颖游戏机制、确保一致的游戏体验并开发可扩展的AI,从而减少人力投入并降低开发门槛。
English: This paper introduces an automated card game prototyping framework that uses graph-based indexing and LLM-driven systems to generate novel game mechanics, ensure consistent gameplay, and develop scalable AI, thereby reducing human effort and barriers for developers.

Authors:Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David Mortensen
Title: Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment
Abstract:
Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.
Chinese: MixGoP是一种利用高斯混合模型对音素分布进行多子簇建模的新方法,在多数数据集上实现了最优性能,并证明自监督语音模型特征比传统方法能更有效地捕捉音位变体差异。
English: MixGoP is a novel approach that uses Gaussian mixture models to model phoneme distributions with multiple subclusters, achieving state-of-the-art performance on most datasets and demonstrating that self-supervised speech model features capture allophonic variation more effectively than traditional methods.

Authors:Haoqi Wang, Tong Zhang, Mathieu Salzmann
Title: Demystifying Singular Defects in Large Language Models
Abstract:
Large transformer models are known to produce high-norm tokens. In vision transformers (ViTs), such tokens have been mathematically modeled through the singular vectors of the linear approximations of layers. However, in large language models (LLMs), the underlying causes of high-norm tokens remain largely unexplored, and their different properties from those of ViTs require a new analysis framework. In this paper, we provide both theoretical insights and empirical validation across a range of recent models, leading to the following observations: i) The layer-wise singular direction predicts the abrupt explosion of token norms in LLMs. ii) The negative eigenvalues of a layer explain its sudden decay. iii) The computational pathways leading to high-norm tokens differ between initial and noninitial tokens. iv) High-norm tokens are triggered by the right leading singular vector of the matrix approximating the corresponding modules. We showcase two practical applications of these findings: the improvement of quantization schemes and the design of LLM signatures. Our findings not only advance the understanding of singular defects in LLMs but also open new avenues for their application. We expect that this work will stimulate further research into the internal mechanisms of LLMs. Code is released at https://github.com/haoqiwang/singular_defect.
中文: 本研究从理论和实证角度揭示了大型语言模型中高范数令牌的成因机制,阐明了其与视觉Transformer的差异,并通过量化方案改进和模型签名设计展示了实际应用价值。
English: This study provides theoretical and empirical insights into the causes of high-norm tokens in large language models, revealing their distinct mechanisms from vision transformers and demonstrating practical applications in quantization improvement and model signature design.

Authors:Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji
Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering
Abstract:
Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants -- whether humans or AI agents -- to stay on the same page as their environment evolves. When a collaborator's understanding diverges from the current state -- what we term the out-of-sync challenge -- the collaborator's actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents' capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent <= 3.33% to Claude-3.5-Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents' resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.
中文: 本文提出SyncMind框架来解决协作软件工程中的"不同步"挑战,即LLM智能体对代码库的理解与实际状态发生偏离,并通过SyncBench基准测试揭示现有智能体在协作意愿和资源意识方面存在根本性局限,尽管协作发生时其性能与恢复成功率呈正相关。
English: This paper introduces SyncMind, a framework addressing the out-of-sync challenge in collaborative software engineering where LLM agents' understanding diverges from actual codebases, and presents SyncBench, a benchmark revealing critical limitations in existing agents' collaboration willingness and resource awareness despite showing performance correlations when collaboration occurs.

Authors:Finnian Westenfelder, Erik Hemberg, Miguel Tulla, Stephen Moskal, Una-May O'Reilly, Silviu Chiricescu
Title: LLM-Supported Natural Language to Bash Translation
Abstract:
The Bourne-Again Shell (Bash) command-line interface for Linux systems has complex syntax and requires extensive specialized knowledge. Using the natural language to Bash command (NL2SH) translation capabilities of large language models (LLMs) for command composition circumvents these issues. However, the NL2SH performance of LLMs is difficult to assess due to inaccurate test data and unreliable heuristics for determining the functional equivalence of Bash commands. We present a manually verified test dataset of 600 instruction-command pairs and a training dataset of 40,939 pairs, increasing the size of previous datasets by 441% and 135%, respectively. Further, we present a novel functional equivalence heuristic that combines command execution with LLM evaluation of command outputs. Our heuristic can determine the functional equivalence of two Bash commands with 95% confidence, a 16% increase over previous heuristics. Evaluation of popular LLMs using our test dataset and heuristic demonstrates that parsing, in-context learning, in-weight learning, and constrained decoding can improve NL2SH accuracy by up to 32%. Our findings emphasize the importance of dataset quality, execution-based evaluation and translation method for advancing NL2SH translation. Our code is available at https://github.com/westenfelder/NL2SH
Large language models can translate natural language to Bash commands, but their performance is hard to evaluate due to poor test data and unreliable equivalence checks; this study introduces verified datasets and a new evaluation method that improves assessment confidence by 16% and translation accuracy by up to 32%.
English Summary:

Authors:Jinyu Xiang, Jiayi Zhang, Zhaoyang Yu, Xinbing Liang, Fengwei Teng, Jinhao Tu, Fashen Ren, Xiangru Tang, Sirui Hong, Chenglin Wu, Yuyu Luo
Title: Self-Supervised Prompt Optimization
Abstract:
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually designed prompts require expertise and iterative experimentation. While existing prompt optimization methods aim to automate this process, they rely heavily on external references such as ground truth or by humans, limiting their applicability in real-world scenarios where such data is unavailable or costly to obtain. To address this, we propose Self-Supervised Prompt Optimization (SPO), a cost-efficient framework that discovers effective prompts for both closed and open-ended tasks without requiring external reference. Motivated by the observations that prompt quality manifests directly in LLM outputs and LLMs can effectively assess adherence to task requirements, we derive evaluation and optimization signals purely from output comparisons. Specifically, SPO selects superior prompts through pairwise output comparisons evaluated by an LLM evaluator, followed by an LLM optimizer that aligns outputs with task requirements. Extensive experiments demonstrate that SPO outperforms state-of-the-art prompt optimization methods, achieving comparable or superior results with significantly lower costs (e.g., 1.1% to 5.6% of existing methods) and fewer samples (e.g., three samples). The code is available at https://github.com/FoundationAgents/SPO.
Chinese Summary: 本文提出自监督提示优化框架(SPO),通过利用纯输出比较而无需外部参考,自主发现适用于封闭式和开放式任务的有效提示,以极低成本实现了优于现有方法的性能。
English Summary: The paper introduces Self-Supervised Prompt Optimization (SPO), a cost-effective framework that autonomously discovers effective prompts for both closed and open-ended tasks by leveraging pairwise output comparisons without requiring external references, achieving superior performance at significantly reduced costs.

Authors:Siyeol Jung, Taehwan Kim
Title: DiffListener: Discrete Diffusion Model for Listener Generation
Abstract:
The listener head generation (LHG) task aims to generate natural nonverbal listener responses based on the speaker's multimodal cues. While prior work either rely on limited modalities (e.g. audio and facial information) or employ autoregressive approaches which have limitations such as accumulating prediction errors. To address these limitations, we propose DiffListener, a discrete diffusion based approach for non-autoregressive listener head generation. Our model takes the speaker's facial information, audio, and text as inputs, additionally incorporating facial differential information to represent the temporal dynamics of expressions and movements. With this explicit modeling of facial dynamics, DiffListener can generate coherent reaction sequences in a non-autoregressive manner. Through comprehensive experiments, DiffListener demonstrates state-of-the-art performance in both quantitative and qualitative evaluations. The user study shows that DiffListener generates natural context-aware listener reactions that are well synchronized with the speaker. The code and demo videos are available in https://siyeoljung.github.io/DiffListener

Authors:Chengqi Lyu, Songyang Gao, Yuzhe Gu, Wenwei Zhang, Jianfei Gao, Kuikun Liu, Ziyi Wang, Shuaibin Li, Qian Zhao, Haian Huang, Weihan Cao, Jiangning Liu, Hongwei Liu, Junnan Liu, Songyang Zhang, Dahua Lin, Kai Chen
Title: Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning
Abstract:
Reasoning abilities, especially those for solving complex math problems, are crucial components of general intelligence. Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks. However, the complete technical details remain unrevealed, and the techniques that are believed certainly to be adopted are only reinforcement learning (RL) and the long chain of thoughts. This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through \textbf{O}utcome \textbf{RE}w\textbf{A}rd-based reinforcement \textbf{L}earning for mathematical reasoning tasks, where only binary outcome rewards are easily accessible. We theoretically prove that behavior cloning on positive trajectories from best-of-N (BoN) sampling is sufficient to learn the KL-regularized optimal policy in binary feedback environments. This formulation further implies that the rewards of negative samples should be reshaped to ensure the gradient consistency between positive and negative samples. To alleviate the long-existing difficulties brought by sparse rewards in RL, which are even exacerbated by the partial correctness of the long chain of thought for reasoning tasks, we further apply a token-level reward model to sample important tokens in reasoning trajectories for learning. With OREAL, for the first time, a 7B model can obtain 94.0 pass@1 accuracy on MATH-500 through RL, being on par with 32B models. OREAL-32B also surpasses previous 32B models trained by distillation with 95.0 pass@1 accuracy on MATH-500. Our investigation also indicates the importance of initial policy models and training queries for RL. Code, models, and data will be released to benefit future research\footnote{https://github.com/InternLM/OREAL}.
中文:本文提出了OREAL这一新型强化学习框架,利用二元结果奖励显著提升语言模型的数学推理能力,使小规模模型首次达到与大型模型相媲美的准确率。
English: This paper introduces OREAL, a novel reinforcement learning framework that uses binary outcome rewards to significantly enhance mathematical reasoning in language models, achieving state-of-the-art accuracy with smaller model sizes.

Authors:Ling Yang, Zhaochen Yu, Bin Cui, Mengdi Wang
Title: ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
Abstract:
We present that hierarchical LLM reasoning via scaling thought templates can effectively optimize the reasoning search space and outperform the mathematical reasoning capabilities of powerful LLMs like OpenAI o1-preview and DeepSeek V3. We train our ReasonFlux-32B model with only 8 GPUs and introduces three innovations: (i) a structured and generic thought template library, containing around 500 high-level thought templates capable of generalizing to similar or relevant reasoning problems; (ii) performing hierarchical reinforcement learning on a sequence of thought templates instead of long CoTs, optimizing a base LLM to plan out an optimal template trajectory for gradually handling complex problems; (iii) a brand new inference scaling system that enables hierarchical LLM reasoning by adaptively scaling thought templates at inference time. With a template trajectory containing more explainable reasoning structures than DeepSeek-R1 and o3-mini, our ReasonFlux-32B significantly advances math reasoning capabilities to state-of-the-art levels. Notably, on the MATH benchmark, it achieves an accuracy of 91.2% and surpasses o1-preview by 6.7%. On the USA Math Olympiad (AIME) benchmark, ReasonFlux-32B solves an average of 56.7% of problems, surpassing o1-preview and DeepSeek-V3 by 27% and 45%, respectively. Code: https://github.com/Gen-Verse/ReasonFlux
中文: ReasonFlux-32B模型通过分层推理和可扩展思维模板,在MATH基准测试中达到91.2%准确率,在AIME中解题率达56.7%,显著超越了OpenAI o1-preview和DeepSeek V3等先进模型。
English: The ReasonFlux-32B model introduces hierarchical reasoning with scalable thought templates, achieving state-of-the-art math performance by surpassing leading models like OpenAI o1-preview and DeepSeek V3 on benchmarks including MATH (91.2% accuracy) and AIME (56.7% problem-solving rate).

Authors:Qingshui Gu, Shu Li, Tianyu Zheng, Zhaoxiang Zhang
Title: Steel-LLM:From Scratch to Open Source -- A Personal Journey in Building a Chinese-Centric LLM
Abstract:
Steel-LLM is a Chinese-centric language model developed from scratch with the goal of creating a high-quality, open-source model despite limited computational resources. Launched in March 2024, the project aimed to train a 1-billion-parameter model on a large-scale dataset, prioritizing transparency and the sharing of practical insights to assist others in the community. The training process primarily focused on Chinese data, with a small proportion of English data included, addressing gaps in existing open-source LLMs by providing a more detailed and practical account of the model-building journey. Steel-LLM has demonstrated competitive performance on benchmarks such as CEVAL and CMMLU, outperforming early models from larger institutions. This paper provides a comprehensive summary of the project's key contributions, including data collection, model design, training methodologies, and the challenges encountered along the way, offering a valuable resource for researchers and practitioners looking to develop their own LLMs. The model checkpoints and training script are available at https://github.com/zhanshijinwat/Steel-LLM.
中文: Steel-LLM是2024年3月发布的中文优先开源语言模型,基于十亿参数规模并以中文数据为核心进行训练,在多项基准测试中表现优异,同时完整公开了模型开发过程与实践经验。
English: Steel-LLM is a Chinese-centric open-source language model developed from scratch in March 2024, featuring 1 billion parameters trained primarily on Chinese data with competitive benchmark performance and full transparency in its development process.

Authors:Zhi Zhou, Kun-Yang Yu, Shi-Yu Tian, Xiao-Wen Yang, Jiang-Xin Shi, Pengxiao Song, Yi-Xuan Jin, Lan-Zhe Guo, Yu-Feng Li
Title: LawGPT: Knowledge-Guided Data Generation and Its Application to Legal LLM
Abstract:
Large language models (LLMs), both proprietary and open-source, have demonstrated remarkable capabilities across various natural language processing tasks. However, they face significant limitations in legal reasoning tasks. Proprietary models introduce data privacy risks and high inference costs, while open-source models underperform due to insufficient legal domain training data. To address these limitations, we study data generation for legal reasoning to improve the legal reasoning performance of open-source LLMs with the help of proprietary LLMs. This is challenging due to the lack of legal knowledge in proprietary LLMs and the difficulty in verifying the generated data. We propose KgDG, a knowledge-guided data generation framework for legal reasoning. Our framework enables leveraging legal knowledge to enhance generation diversity and introduces a refinement and verification process to ensure the quality of generated data. Moreover, we expand the generated dataset to further enhance the LLM reasoning capabilities. Using KgDG, we create a synthetic legal reasoning dataset containing 50K high-quality examples. Our trained model LawGPT outperforms existing legal-specific LLMs and achieves performance comparable to proprietary LLMs, demonstrating the effectiveness of KgDG and LawGPT. Our code and resources is publicly available at https://github.com/LAMDASZ-ML/Knowledge-Guide-Data-Generation .
中文摘要:本研究提出知识引导数据生成框架KgDG,通过生成高质量法律推理数据集提升开源大语言模型的性能,其训练的LawGPT模型在保持数据隐私和成本优势的同时,达到了与商业模型相当的法律推理能力。
English Summary: This study introduces KgDG, a knowledge-guided data generation framework that creates high-quality legal reasoning datasets to enhance open-source LLMs' performance, with the resulting LawGPT model matching proprietary LLMs' capabilities while addressing privacy and cost concerns.

Authors:Chengwen Qi, Ren Ma, Bowen Li, He Du, Binyuan Hui, Jinwang Wu, Yuanjun Laili, Conghui He
Title: Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation
Abstract:
First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset's challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: https://github.com/opendatalab/ProverGen
Chinese: ProverGen框架创新性地结合了大语言模型与符号证明器,构建出包含逻辑连贯中间推理步骤的ProverQA数据集,即使采用思维链提示,当前最先进的LLM仍难以解决其问题,凸显了该数据集的挑战性。
English: ProverGen is a novel framework that combines Large Language Models with symbolic provers to create ProverQA, a challenging FOL reasoning dataset with coherent intermediate steps, which state-of-the-art LLMs struggle to solve even with chain-of-thought prompting.

Authors:Yibo Wang, Congying Xia, Wenting Zhao, Jiangshu Du, Chunyu Miao, Zhongfen Deng, Philip S. Yu, Chen Xing
Title: ProjectTest: A Project-level LLM Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms
Abstract:
Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant basic yet critical errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms. Our code and dataset is available at \href{https://github.com/YiboWANG214/ProjectTest}{ProjectTest}.
中文: ProjectTest提出了一个针对Python、Java和JavaScript的项目级单元测试生成基准,发现即使先进的大语言模型也表现中等且存在关键错误,同时通过纠错机制探索了其改进潜力。
English: ProjectTest introduces a project-level benchmark for unit test generation across Python, Java, and JavaScript, revealing that even advanced LLMs struggle with moderate performance and critical errors, while also exploring their potential through error-fixing mechanisms.

Authors:Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang
Title: Systematic Outliers in Large Language Models
Abstract:
Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Understanding the functionality and formation mechanisms of these outliers is critically important. Existing works, however, largely focus on reducing the impact of outliers from an algorithmic perspective, lacking an in-depth investigation into their causes and roles. In this work, we provide a detailed analysis of the formation process, underlying causes, and functions of outliers in LLMs. We define and categorize three types of outliers-activation outliers, weight outliers, and attention outliers-and analyze their distributions across different dimensions, uncovering inherent connections between their occurrences and their ultimate influence on the attention mechanism. Based on these observations, we hypothesize and explore the mechanisms by which these outliers arise and function, demonstrating through theoretical derivations and experiments that they emerge due to the self-attention mechanism's softmax operation. These outliers act as implicit context-aware scaling factors within the attention mechanism. As these outliers stem from systematic influences, we term them systematic outliers. Our study not only enhances the understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression. The code is avilable at https://github.com/an-yongqi/systematic-outliers.
中文摘要:本研究分析了大型语言模型中异常值的形成机制、成因及功能,揭示其源于自注意力机制的softmax操作并作为隐式缩放因子发挥作用,实验表明结构性地消除这些异常值可加速模型收敛并提升压缩效果。
English Summary: This study analyzes the formation, causes, and functions of outliers in Large Language Models, revealing they emerge from the softmax operation in self-attention and serve as implicit scaling factors, with their structural elimination shown to accelerate convergence and improve model compression.

Authors:Haiduo Huang, Fuwei Yang, Zhenhua Liu, Yixing Xu, Jinze Li, Yang Liu, Xuanwu Yin, Dong Li, Pengju Ren, Emad Barsoum
Title: Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE
Abstract:
Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to predict multiple tokens, which are then verified in parallel by the larger target model. However, the limited capacity of the draft model often necessitates tree-based sampling to improve prediction accuracy, where multiple candidates are generated at each step. We identify a key limitation in this approach: the candidates at the same step are derived from the same representation, limiting diversity and reducing overall effectiveness. To address this, we propose Jakiro, leveraging Mixture of Experts (MoE), where independent experts generate diverse predictions, effectively decoupling correlations among candidates. Furthermore, we introduce a hybrid inference strategy, combining autoregressive decoding for initial tokens with parallel decoding for subsequent stages, and enhance the latter with contrastive mechanism in features to improve accuracy. Our method significantly boosts prediction accuracy and achieves higher inference speedups. Extensive experiments across diverse models validate the effectiveness and robustness of our approach, establishing a new SOTA in speculative decoding. Our codes are available at https://github.com/haiduo/Jakiro.
中文: Jakiro通过采用专家混合模型生成多样化的令牌预测并结合混合推理策略,显著提高了推测解码的准确性和推理速度,在不同模型上均验证了其有效性。
English: Jakiro enhances speculative decoding by employing Mixture of Experts to generate diverse token predictions and a hybrid inference strategy, significantly improving both accuracy and inference speed across various models.

Authors:Zhichen Dong, Zhanhui Zhou, Zhixuan Liu, Chao Yang, Chaochao Lu
Title: Emergent Response Planning in LLMs
Abstract:
In this work, we argue that large language models (LLMs), though trained to predict only the next token, exhibit emergent planning behaviors: $\textbf{their hidden representations encode future outputs beyond the next token}$. Through simple probing, we demonstrate that LLM prompt representations encode global attributes of their entire responses, including $\textit{structure attributes}$ (e.g., response length, reasoning steps), $\textit{content attributes}$ (e.g., character choices in storywriting, multiple-choice answers at the end of response), and $\textit{behavior attributes}$ (e.g., answer confidence, factual consistency). In addition to identifying response planning, we explore how it scales with model size across tasks and how it evolves during generation. The findings that LLMs plan ahead for the future in their hidden representations suggest potential applications for improving transparency and generation control.
中文: 本研究发现大型语言模型通过隐藏表征编码未来输出的结构、内容和行为等全局属性,展现出涌现的规划能力,这种能力随模型规模扩展并在生成过程中演变,为提升透明度和控制力提供了可能。
English: This study reveals that large language models exhibit emergent planning capabilities by encoding future response attributes—such as structure, content, and behavior—in their hidden representations, which scales with model size and evolves during generation, offering potential for enhanced transparency and control.

Authors:Yu Wang, Nan Yang, Liang Wang, Furu Wei, Fuli Feng
Title: Examining False Positives under Inference Scaling for Mathematical Reasoning
Abstract:
Recent advancements in language models have led to significant improvements in mathematical reasoning across various benchmarks. However, most of these benchmarks rely on automatic evaluation methods that only compare final answers using heuristics, without verifying the underlying reasoning steps. This limitation results in false positive solutions, where models may produce correct final answers but with flawed deduction paths. In this paper, we systematically examine the prevalence of false positive solutions in mathematical problem solving for language models. We analyze the characteristics and extent of this issue across different open-source models, datasets of varying difficulty levels, and decoding strategies. Specifically, we explore how false positives influence the inference time scaling behavior of language models. Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives, suggesting a significantly lower scaling ceiling than what automatic evaluations indicate. Additionally, we analyze specific instances of false positives and discuss potential limitations in self-improvement techniques and synthetic data generation under such conditions. Our data and code are publicly available at https://github.com/Wloner0809/False-Positives-in-Math.
Chinese: 当前语言模型在数学推理中普遍存在虚假正解现象,即答案正确但推理过程存在缺陷,这一问题在不同模型和数据集上持续存在,并削弱了pass@N等自动评估指标的可靠性。
English: Current language models often produce false positive solutions in mathematical reasoning where correct answers mask flawed deduction processes, which persist across various models and datasets and undermine the reliability of automatic evaluation metrics like pass@N.

Authors:Naome A. Etori, Maria L. Gini
Title: RideKE: Leveraging Low-Resource, User-Generated Twitter Content for Sentiment and Emotion Detection in Kenyan Code-Switched Dataset
Abstract:
Social media has become a crucial open-access platform for individuals to express opinions and share experiences. However, leveraging low-resource language data from Twitter is challenging due to scarce, poor-quality content and the major variations in language use, such as slang and code-switching. Identifying tweets in these languages can be difficult as Twitter primarily supports high-resource languages. We analyze Kenyan code-switched data and evaluate four state-of-the-art (SOTA) transformer-based pretrained models for sentiment and emotion classification, using supervised and semi-supervised methods. We detail the methodology behind data collection and annotation, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2\%) and F1 score (66.1\%), XLM-R semi-supervised (67.2\% accuracy, 64.1\% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8\%) and F1 score (31\%), mBERT semi-supervised (accuracy (59\% and F1 score 26.5\%). AfriBERTa models show the lowest accuracy and F1 scores. All models tend to predict neutral sentiment, with Afri-BERT showing the highest bias and unique sensitivity to empathy emotion. https://github.com/NEtori21/Ride_hailing
中文: 本研究评估了四种基于Transformer的模型对肯尼亚语码转换推特数据进行情感与情绪分类的效果,发现XLM-R在情感分析中表现最优而DistilBERT在情绪分析中领先,所有模型均呈现中性预测倾向并具有独特的情感偏差特征。
English: This study evaluates four transformer-based models for sentiment and emotion classification on Kenyan code-switched Twitter data, finding that XLM-R performs best for sentiment analysis while DistilBERT leads in emotion analysis, with all models showing a tendency toward neutral predictions and unique emotional biases.

Authors:Sumin An, Junyoung Sung, Wonpyo Park, Chanjun Park, Paul Hongsuck Seo
Title: LCIRC: A Recurrent Compression Approach for Efficient Long-form Context and Query Dependent Modeling in LLMs
Abstract:
While large language models (LLMs) excel in generating coherent and contextually rich outputs, their capacity to efficiently handle long-form contexts is limited by fixed-length position embeddings. Additionally, the computational cost of processing long sequences increases quadratically, making it challenging to extend context length. To address these challenges, we propose Long-form Context Injection with Recurrent Compression (LCIRC), a method that enables the efficient processing long-form sequences beyond the model's length limit through recurrent compression without retraining the entire model. We further introduce query dependent context modeling, which selectively compresses query-relevant information, ensuring that the model retains the most pertinent content. Our empirical results demonstrate that Query Dependent LCIRC (QD-LCIRC) significantly improves LLM's ability to manage extended contexts, making it well-suited for tasks that require both comprehensive context understanding and query relevance.

Authors:Ce Zhang, Zifu Wan, Zhehan Kan, Martin Q. Ma, Simon Stepputtis, Deva Ramanan, Russ Salakhutdinov, Louis-Philippe Morency, Katia Sycara, Yaqi Xie
Title: Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models
Abstract:
While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.
Chinese: 近期大型视觉语言模型常产生与视觉输入不符的幻觉文本,而新型无训练算法DeGF通过文本到图像生成模型在解码过程中提供自反馈,有效减少幻觉生成,在多个基准测试中超越现有最优方法。
English: Recent Large Vision-Language Models (LVLMs) often produce hallucinatory text responses, but a new training-free algorithm called Decoding with Generative Feedback (DeGF) leverages text-to-image generative models to provide self-feedback during decoding, effectively reducing hallucinations and outperforming state-of-the-art methods across multiple benchmarks.

Authors:Jian Xu, Sichun Luo, Xiangyu Chen, Haoming Huang, Hanxu Hou, Linqi Song
Title: RALLRec: Improving Retrieval Augmented Large Language Model Recommendation with Representation Learning
Abstract:
Large Language Models (LLMs) have been integrated into recommendation systems to enhance user behavior comprehension. The Retrieval Augmented Generation (RAG) technique is further incorporated into these systems to retrieve more relevant items and improve system performance. However, existing RAG methods rely primarily on textual semantics and often fail to incorporate the most relevant items, limiting the effectiveness of the systems. In this paper, we propose Representation learning for retrieval-Augmented Large Language model Recommendation (RALLRec). Specifically, we enhance textual semantics by prompting LLMs to generate more detailed item descriptions, followed by joint representation learning of textual and collaborative semantics, which are extracted by the LLM and recommendation models, respectively. Considering the potential time-varying characteristics of user interest, a simple yet effective reranking method is further introduced to capture the dynamics of user preference. We conducted extensive experiments on three real-world datasets, and the evaluation results validated the effectiveness of our method. Code is made public at https://github.com/JianXu95/RALLRec.
中文摘要:本文提出RALLRec方法,通过结合大语言模型生成的详细项目描述与协同过滤,并引入重排序技术来捕捉用户兴趣的动态变化,实验证明该方法能有效提升推荐系统性能。
English Summary: The paper introduces RALLRec, a method that enhances recommendation systems by combining detailed LLM-generated item descriptions with collaborative filtering and a reranking technique to adapt to dynamic user preferences, showing effectiveness in experiments.

Authors:Saptarshi Ghosh, Tianyu Jiang
Title: ConMeC: A Dataset for Metonymy Resolution with Common Nouns
Abstract:
Metonymy plays an important role in our daily communication. People naturally think about things using their most salient properties or commonly related concepts. For example, by saying "The bus decided to skip our stop today," we actually mean that the bus driver made the decision, not the bus. Prior work on metonymy resolution has mainly focused on named entities. However, metonymy involving common nouns (such as desk, baby, and school) is also a frequent and challenging phenomenon. We argue that NLP systems should be capable of identifying the metonymic use of common nouns in context. We create a new metonymy dataset ConMeC, which consists of 6,000 sentences, where each sentence is paired with a target common noun and annotated by humans to indicate whether that common noun is used metonymically or not in that context. We also introduce a chain-of-thought based prompting method for detecting metonymy using large language models (LLMs). We evaluate our LLM-based pipeline, as well as a supervised BERT model on our dataset and three other metonymy datasets. Our experimental results demonstrate that LLMs could achieve performance comparable to the supervised BERT model on well-defined metonymy categories, while still struggling with instances requiring nuanced semantic understanding. Our dataset is publicly available at: https://github.com/SaptGhosh/ConMeC.
Chinese: 本研究推出了用于检测普通名词转喻的新数据集ConMeC,并证明大型语言模型在识别转喻用法上可与监督式BERT模型相媲美,但在处理语义细微的实例时仍存在困难。
English: This study introduces ConMeC, a new dataset for detecting metonymy in common nouns, and demonstrates that large language models can perform comparably to supervised BERT models in identifying metonymic usage, though challenges remain with nuanced cases.

Authors:Seokwon Song, Taehyun Lee, Jaewoo Ahn, Jae Hyuk Sung, Gunhee Kim
Title: Is a Peeled Apple Still Red? Evaluating LLMs' Ability for Conceptual Combination with Property Type
Abstract:
Conceptual combination is a cognitive process that merges basic concepts, enabling the creation of complex expressions. During this process, the properties of combination (e.g., the whiteness of a peeled apple) can be inherited from basic concepts, newly emerge, or be canceled. However, previous studies have evaluated a limited set of properties and have not examined the generative process. To address this gap, we introduce the Conceptual Combination with Property Type dataset (CCPT), which consists of 12.3K annotated triplets of noun phrases, properties, and property types. Using CCPT, we establish three types of tasks to evaluate LLMs for conceptual combination thoroughly. Our key findings are threefold: (1) Our automatic metric grading property emergence and cancellation closely corresponds with human judgments. (2) LLMs, including OpenAI's o1, struggle to generate noun phrases which possess given emergent properties. (3) Our proposed method, inspired by cognitive psychology model that explains how relationships between concepts are formed, improves performances in all generative tasks. The dataset and experimental code are available at https://github.com/seokwon99/CCPT.git.
中文: 本研究引入CCPT数据集评估大语言模型的概念组合能力,发现模型在生成具有涌现属性的名词短语时存在困难,但通过认知心理学启发的改进方法能有效提升表现。
English: The study introduces the CCPT dataset to evaluate how large language models handle conceptual combination, finding they struggle with emergent properties but can be improved using cognitive psychology-inspired methods.

Authors:Jiabin Tang, Tianyu Fan, Chao Huang
Title: AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
Abstract:
Large Language Model (LLM) Agents have demonstrated remarkable capabilities in task automation and intelligent decision-making, driving the widespread adoption of agent development frameworks such as LangChain and AutoGen. However, these frameworks predominantly serve developers with extensive technical expertise - a significant limitation considering that only 0.03 % of the global population possesses the necessary programming skills. This stark accessibility gap raises a fundamental question: Can we enable everyone, regardless of technical background, to build their own LLM agents using natural language alone? To address this challenge, we introduce AutoAgent-a Fully-Automated and highly Self-Developing framework that enables users to create and deploy LLM agents through Natural Language Alone. Operating as an autonomous Agent Operating System, AutoAgent comprises four key components: i) Agentic System Utilities, ii) LLM-powered Actionable Engine, iii) Self-Managing File System, and iv) Self-Play Agent Customization module. This lightweight yet powerful system enables efficient and dynamic creation and modification of tools, agents, and workflows without coding requirements or manual intervention. Beyond its code-free agent development capabilities, AutoAgent also serves as a versatile multi-agent system for General AI Assistants. Comprehensive evaluations on the GAIA benchmark demonstrate AutoAgent's effectiveness in generalist multi-agent tasks, surpassing existing state-of-the-art methods. Furthermore, AutoAgent's Retrieval-Augmented Generation (RAG)-related capabilities have shown consistently superior performance compared to many alternative LLM-based solutions.
中文:AutoAgent是一个全自动框架,允许用户仅通过自然语言创建和部署LLM智能体,突破了现有框架的技术壁垒,并在多智能体任务和检索增强生成能力上展现出卓越性能。
English: AutoAgent is a fully automated framework that enables users to create and deploy LLM agents using natural language alone, overcoming the technical barriers of existing frameworks and demonstrating superior performance in multi-agent tasks and RAG capabilities.

Authors:Paul Darm, Annalisa Riccardi
Title: Head-Specific Intervention Can Induce Misaligned AI Coordination in Large Language Models
Abstract:
Robust alignment guardrails for large language models (LLMs) are becoming increasingly important with their widespread application. In contrast to previous studies, we demonstrate that inference-time activation interventions can bypass safety alignments and effectively steer model generations towards harmful AI coordination. Our method applies fine-grained interventions at specific attention heads, which we identify by probing each head in a simple binary choice task. We then show that interventions on these heads generalise to the open-ended generation setting, effectively circumventing safety guardrails. We demonstrate that intervening on a few attention heads is more effective than intervening on full layers or supervised fine-tuning. We further show that only a few example completions are needed to compute effective steering directions, which is an advantage over classical fine-tuning. We also demonstrate that applying interventions in the negative direction can prevent a common jailbreak attack. Our results suggest that, at the attention head level, activations encode fine-grained linearly separable behaviours. Practically, the approach offers a straightforward methodology to steer large language model behaviour, which could be extended to diverse domains beyond safety, requiring fine-grained control over the model output. The code and datasets for this study can be found on https://github.com/PaulDrm/targeted_intervention.
中文摘要:本研究揭示通过对特定注意力头进行针对性干预,能有效突破大语言模型的安全防护机制并引导其生成有害内容,该方法比传统微调更具精准控制优势。
English Summary: This study reveals that targeted interventions on specific attention heads can effectively bypass LLM safety alignments and steer model behavior toward harmful outputs, offering a fine-grained control method that surpasses traditional fine-tuning approaches.

Authors:Hongye Liu, Ricardo Henao
Title: Learning to Substitute Words with Model-based Score Ranking
Abstract:
Smart word substitution aims to enhance sentence quality by improving word choices; however current benchmarks rely on human-labeled data. Since word choices are inherently subjective, ground-truth word substitutions generated by a small group of annotators are often incomplete and likely not generalizable. To circumvent this issue, we instead employ a model-based score (BARTScore) to quantify sentence quality, thus forgoing the need for human annotations. Specifically, we use this score to define a distribution for each word substitution, allowing one to test whether a substitution is statistically superior relative to others. In addition, we propose a loss function that directly optimizes the alignment between model predictions and sentence scores, while also enhancing the overall quality score of a substitution. Crucially, model learning no longer requires human labels, thus avoiding the cost of annotation while maintaining the quality of the text modified with substitutions. Experimental results show that the proposed approach outperforms both masked language models (BERT, BART) and large language models (GPT-4, LLaMA). The source code is available at https://github.com/Hyfred/Substitute-Words-with-Ranking.
中文摘要:本研究提出了一种基于模型的评分方法BARTScore,无需人工标注即可评估词语替换效果,通过优化预测与句子质量的匹配度,在实验中超越了现有语言模型的性能。
English Summary: This study introduces a model-based scoring method, BARTScore, to evaluate word substitutions without human annotations, optimizing alignment between predictions and sentence quality while outperforming existing language models.

Authors:Runchuan Zhu, Zinco Jiang, Jiang Wu, Zhipeng Ma, Jiahe Song, Fengshuo Bai, Dahua Lin, Lijun Wu, Conghui He
Title: GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation
Abstract:
Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge, thereby reducing hallucinations and improving reliability. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected, thereby maintain the helpfulness of LLM outputs. In this paper, we address the two challenges by deriving insightful observations from the gradient-based perspective, and proposing the Gradient-driven Refusal Aware Instruction Tuning Framework GRAIT: (1) employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal, achieving the balance between accurate refusals and maintaining useful responses. Experimental evaluations on open-ended and multiple-choice question answering tasks demonstrate that GRAIT significantly outperforms existing RAIT methods in the overall performance. The source code and data will be available at https://github.com/opendatalab/GRAIT .
Chinese: GRAIT框架通过梯度驱动的样本选择减少幻觉,并采用自适应权重机制防止过度拒绝,从而在问答任务中优于现有方法,提升了大型语言模型的性能。
English: GRAIT is a framework that enhances large language models by using gradient-driven sample selection to reduce hallucinations and an adaptive weighting mechanism to prevent over-refusal, achieving better performance in question-answering tasks than existing methods.

Authors:Jen-tse Huang, Yuhang Yan, Linqi Liu, Yixin Wan, Wenxuan Wang, Kai-Wei Chang, Michael R. Lyu
Title: Where Fact Ends and Fairness Begins: Redefining AI Bias Evaluation through Cognitive Biases
Abstract:
Recent failures such as Google Gemini generating people of color in Nazi-era uniforms illustrate how AI outputs can be factually plausible yet socially harmful. AI models are increasingly evaluated for "fairness," yet existing benchmarks often conflate two fundamentally different dimensions: factual correctness and normative fairness. A model may generate responses that are factually accurate but socially unfair, or conversely, appear fair while distorting factual reality. We argue that identifying the boundary between fact and fair is essential for meaningful fairness evaluation. We introduce Fact-or-Fair, a benchmark with (i) objective queries aligned with descriptive, fact-based judgments, and (ii) subjective queries aligned with normative, fairness-based judgments. Our queries are constructed from 19 statistics and are grounded in cognitive psychology, drawing on representativeness bias, attribution bias, and ingroup-outgroup bias to explain why models often misalign fact and fairness. Experiments across ten frontier models reveal different levels of fact-fair trade-offs. By reframing fairness evaluation, we provide both a new theoretical lens and a practical benchmark to advance the responsible model assessments. Our test suite is publicly available at https://github.com/uclanlp/Fact-or-Fair.
中文摘要:该摘要强调区分AI评估中事实准确性与规范性公平的重要性,并介绍了Fact-or-Fair基准测试,通过基于心理偏见的客观和主观查询来检验模型在这两个维度的表现。
English Summary: The abstract discusses the need to distinguish between factual accuracy and normative fairness in AI evaluations, introducing the Fact-or-Fair benchmark to address this gap by testing models on both objective and subjective queries based on psychological biases.

Authors:Zherui Li, Houcheng Jiang, Hao Chen, Baolong Bi, Zhenhong Zhou, Fei Sun, Junfeng Fang, Xiang Wang
Title: Reinforced Lifelong Editing for Language Models
Abstract:
Large language models (LLMs) acquire information from pre-training corpora, but their stored knowledge can become inaccurate or outdated over time. Model editing addresses this challenge by modifying model parameters without retraining, and prevalent approaches leverage hypernetworks to generate these parameter updates. However, they face significant challenges in lifelong editing due to their incompatibility with LLM parameters that dynamically change during the editing process. To address this, we observed that hypernetwork-based lifelong editing aligns with reinforcement learning modeling and proposed RLEdit, an RL-based editing method. By treating editing losses as rewards and optimizing hypernetwork parameters at the full knowledge sequence level, we enable it to precisely capture LLM changes and generate appropriate parameter updates. Our extensive empirical evaluation across several LLMs demonstrates that RLEdit outperforms existing methods in lifelong editing with superior effectiveness and efficiency, achieving a 59.24% improvement while requiring only 2.11% of the time compared to most approaches. Our code is available at: https://github.com/zhrli324/RLEdit.
中文: 大语言模型的知识会随时间过时,RLEdit通过强化学习方法优化超网络参数,能精准捕捉模型变化,在持续编辑中实现59.24%的效果提升且仅需2.11%的时间。
English: Large language models face challenges with outdated knowledge, and RLEdit, a reinforcement learning-based editing method, significantly improves lifelong editing by optimizing hypernetwork parameters to capture model changes efficiently.

Authors:Md. Ashraful Islam, Mohammed Eunus Ali, Md Rizwan Parvez
Title: CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging
Abstract:
Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSim, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis-planning, coding, and debugging-through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSim uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSim's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results-(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers. To facilitate further research and development in this area, we have open-sourced our framework in this link (https://kagnlp.github.io/codesim.github.io/).

Authors:Marian Lupascu, Ana-Cristina Rogoz, Mihai Sorin Stupariu, Radu Tudor Ionescu
Title: Large Multimodal Models for Low-Resource Languages: A Survey
Abstract:
In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion strategies. Through a comprehensive analysis of 106 studies across 75 LR languages, we identify key patterns in how researchers tackle the challenges of limited data and computational resources. We find that visual information often serves as a crucial bridge for improving model performance in LR settings, though significant challenges remain in areas such as hallucination mitigation and computational efficiency. We aim to provide researchers with a clear understanding of current approaches and remaining challenges in making LMMs more accessible to speakers of LR (understudied) languages. We complement our survey with an open-source repository available at: https://github.com/marianlupascu/LMM4LRL-Survey.
中文摘要:本综述系统分析了将大型多模态模型适配低资源语言的技术,发现视觉增强是提升性能的关键桥梁,但幻觉缓解和计算效率仍是主要挑战。
English Summary: This survey systematically examines techniques for adapting large multimodal models to low-resource languages, identifying visual enhancement as a key strategy while highlighting persistent challenges in hallucination mitigation and computational efficiency.

Authors:Zhiqiang Liu, Chengtao Gan, Junjie Wang, Yichi Zhang, Zhongpu Bo, Mengshu Sun, Huajun Chen, Wen Zhang
Title: OntoTune: Ontology-Driven Self-training for Aligning Large Language Models
Abstract:
Existing domain-specific Large Language Models (LLMs) are typically developed by fine-tuning general-purposed LLMs with large-scale domain-specific corpora. However, training on large-scale corpora often fails to effectively organize domain knowledge of LLMs, leading to fragmented understanding. Inspired by how humans connect concepts and organize knowledge through mind maps, we aim to emulate this approach by using ontology with hierarchical conceptual knowledge to reorganize LLM's domain knowledge. From this perspective, we propose an ontology-driven self-training framework called OntoTune, which aims to align LLMs with ontology through in-context learning, enabling the generation of responses guided by the ontology. We leverage in-context learning to identify whether the LLM has acquired the specific concept's ontology knowledge, and select the entries not yet mastered by LLM as the training set to further align the LLM with ontology. Compared to existing domain LLMs based on newly collected large-scale domain-specific corpora, our OntoTune, which relies on the existing, long-term developed ontology and LLM itself, significantly reduces data maintenance costs and offers improved generalization ability. We conduct our study in the medical domain to evaluate the effectiveness of OntoTune, utilizing a standardized medical ontology, SNOMED CT as our ontology source. Experimental results demonstrate that OntoTune achieves state-of-the-art performance in both in-ontology task hypernym discovery and out-of-ontology task medical domain QA. Moreover, compared to the latest direct ontology injection method TaxoLLaMA, our OntoTune better preserves original knowledge of LLM. The code and data are available at https://github.com/zjukg/OntoTune.
中文摘要:OntoTune框架通过本体驱动的自训练方法,将大语言模型与层次化知识结构对齐,在提升领域任务表现的同时显著降低数据维护成本并保留模型原有知识。
English Summary: The OntoTune framework enhances domain-specific LLMs by using ontology-driven self-training to align models with hierarchical knowledge structures, improving performance while reducing data costs and preserving original capabilities.

Authors:Chongyu Fan, Jinghan Jia, Yihua Zhang, Anil Ramakrishna, Mingyi Hong, Sijia Liu
Title: Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond
Abstract:
The LLM unlearning technique has recently been introduced to comply with data regulations and address the safety and ethical concerns of LLMs by removing the undesired data-model influence. However, state-of-the-art unlearning methods face a critical vulnerability: they are susceptible to ``relearning'' the removed information from a small number of forget data points, known as relearning attacks. In this paper, we systematically investigate how to make unlearned models robust against such attacks. For the first time, we establish a connection between robust unlearning and sharpness-aware minimization (SAM) through a unified robust optimization framework, in an analogy to adversarial training designed to defend against adversarial attacks. Our analysis for SAM reveals that smoothness optimization plays a pivotal role in mitigating relearning attacks. Thus, we further explore diverse smoothing strategies to enhance unlearning robustness. Extensive experiments on benchmark datasets, including WMDP and MUSE, demonstrate that SAM and other smoothness optimization approaches consistently improve the resistance of LLM unlearning to relearning attacks. Notably, smoothness-enhanced unlearning also helps defend against (input-level) jailbreaking attacks, broadening our proposal's impact in robustifying LLM unlearning. Codes are available at https://github.com/OPTML-Group/Unlearn-Smooth.
中文: 本研究提出了一种鲁棒遗忘框架,通过锐度感知最小化和平滑性优化来防御大语言模型中的再学习和越狱攻击,并在基准数据集上进行了实验验证。
English: The study introduces a robust unlearning framework for LLMs that leverages sharpness-aware minimization and smoothness optimization to defend against relearning and jailbreaking attacks, with experimental validation on benchmark datasets.

Authors:Weihua Du, Yiming Yang, Sean Welleck
Title: Optimizing Temperature for Language Models with Multi-Sample Inference
Abstract:
Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.
中文摘要:本文提出一种无需标注验证数据的自动化方法,通过新颖的基于熵的指标和随机过程模型优化大语言模型中的温度选择,在多样本聚合策略下显著提升模型性能与可解释性。
English Summary: This paper introduces an automated method for optimizing temperature selection in large language models using multi-sample aggregation, eliminating the need for labeled validation data through a novel entropy-based metric and stochastic process model for improved performance and interpretability.

Authors:Gonzalo Gonzalez-Pumariega, Leong Su Yean, Neha Sunkara, Sanjiban Choudhury
Title: Robotouille: An Asynchronous Planning Benchmark for LLM Agents
Abstract:
Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents' ability to handle long-horizon asynchronous scenarios. Our synchronous and asynchronous datasets capture increasingly complex planning challenges that go beyond existing benchmarks, requiring agents to manage overlapping tasks and interruptions. Our results show that ReAct (gpt4-o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution. Code is available at https://github.com/portal-cornell/robotouille.
中文: 摘要介绍了Robotouille基准测试,旨在评估大语言模型代理处理复杂长程任务的异步规划能力,揭示了其性能差距及改进推理与反馈机制的必要性。
English: The abstract introduces Robotouille, a benchmark designed to evaluate LLM agents' asynchronous planning capabilities for complex, long-horizon tasks, revealing significant performance gaps and the need for improved reasoning and feedback integration.

Authors:Xingjun Ma, Yifeng Gao, Yixu Wang, Ruofan Wang, Xin Wang, Ye Sun, Yifan Ding, Hengyuan Xu, Yunhao Chen, Yunhan Zhao, Hanxun Huang, Yige Li, Yutao Wu, Jiaming Zhang, Xiang Zheng, Yang Bai, Zuxuan Wu, Xipeng Qiu, Jingfeng Zhang, Yiming Li, Xudong Han, Haonan Li, Jun Sun, Cong Wang, Jindong Gu, Baoyuan Wu, Siheng Chen, Tianwei Zhang, Yang Liu, Mingming Gong, Tongliang Liu, Shirui Pan, Cihang Xie, Tianyu Pang, Yinpeng Dong, Ruoxi Jia, Yang Zhang, Shiqing Ma, Xiangyu Zhang, Neil Gong, Chaowei Xiao, Sarah Erfani, Tim Baldwin, Bo Li, Masashi Sugiyama, Dacheng Tao, James Bailey, Yu-Gang Jiang
Title: Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety
Abstract:
The rapid advancement of large models, driven by their exceptional abilities in learning and generalization through large-scale pre-training, has reshaped the landscape of Artificial Intelligence (AI). These models are now foundational to a wide range of applications, including conversational AI, recommendation systems, autonomous driving, content generation, medical diagnostics, and scientific discovery. However, their widespread deployment also exposes them to significant safety risks, raising concerns about robustness, reliability, and ethical implications. This survey provides a systematic review of current safety research on large models, covering Vision Foundation Models (VFMs), Large Language Models (LLMs), Vision-Language Pre-training (VLP) models, Vision-Language Models (VLMs), Diffusion Models (DMs), and large-model-powered Agents. Our contributions are summarized as follows: (1) We present a comprehensive taxonomy of safety threats to these models, including adversarial attacks, data poisoning, backdoor attacks, jailbreak and prompt injection attacks, energy-latency attacks, data and model extraction attacks, and emerging agent-specific threats. (2) We review defense strategies proposed for each type of attacks if available and summarize the commonly used datasets and benchmarks for safety research. (3) Building on this, we identify and discuss the open challenges in large model safety, emphasizing the need for comprehensive safety evaluations, scalable and effective defense mechanisms, and sustainable data practices. More importantly, we highlight the necessity of collective efforts from the research community and international collaboration. Our work can serve as a useful reference for researchers and practitioners, fostering the ongoing development of comprehensive defense systems and platforms to safeguard AI models.
Chinese: 大型模型通过大规模预训练重塑了人工智能格局,但其广泛应用也带来了严重的安全隐患,本综述系统梳理了相关威胁与防御策略,为构建安全AI体系提供重要参考。
English: Large models are revolutionizing AI across various fields but face significant safety risks, prompting a systematic review of threats and defenses to guide future secure development.

Authors:Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Tom Goldstein
Title: Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Abstract:
We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
中文摘要:本研究提出了一种新颖的语言模型架构,通过隐式潜在空间推理扩展测试时计算,无需专门训练数据,在增加计算负载的情况下显著提升了推理基准测试性能。
English Summary: This study introduces a novel language model architecture that scales test-time computation through latent reasoning, requiring no specialized training data and outperforming traditional reasoning models on benchmarks with increased computational load.

Authors:Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze
Title: NoLiMa: Long-Context Evaluation Beyond Literal Matching
Abstract:
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.
中文: NoLiMa基准测试通过减少问题与关键信息间的词汇重叠,解决了现有长文本评估的局限性,结果显示13个主流大语言模型在上下文长度增加时性能显著下降,尽管它们声称支持超过12.8万词元的处理能力。
English: The NoLiMa benchmark addresses limitations in existing long-context evaluation by minimizing lexical overlap between questions and relevant information, revealing significant performance degradation in 13 major LLMs as context length increases despite their claimed 128K+ token capacity.

Authors:Yihe Deng, Yu Yang, Junkai Zhang, Wei Wang, Bo Li
Title: DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails
Abstract:
The rapid advancement of large language models (LLMs) has increased the need for guardrail models to ensure responsible use, particularly in detecting unsafe and illegal content. While substantial safety data exist in English, multilingual guardrail modeling remains underexplored due to the scarcity of open-source safety data in other languages. To address this gap, we propose a novel two-player Reinforcement Learning (RL) framework, where a generator and a guardrail model co-evolve adversarially to produce high-quality synthetic data for multilingual guardrail training. We theoretically formalize this interaction as a two-player game, proving convergence to a Nash equilibrium. Empirical evaluations show that our model \ours outperforms state-of-the-art models, achieving nearly 10% improvement over LlamaGuard3 (8B) on English benchmarks while being 4.5x faster at inference with a significantly smaller model (0.5B). We achieve substantial advancements in multilingual safety tasks, particularly in addressing the imbalance for lower-resource languages in a collected real dataset. Ablation studies emphasize the critical role of synthetic data generation in bridging the imbalance in open-source data between English and other languages. These findings establish a scalable and efficient approach to synthetic data generation, paving the way for improved multilingual guardrail models to enhance LLM safety. Code, model, and data will be open-sourced at https://github.com/yihedeng9/DuoGuard.
中文: 本研究提出了一种新颖的双玩家强化学习框架,通过生成高质量合成数据来增强多语言护栏模型,在性能和效率上均优于现有方法。
English: This study introduces a novel two-player reinforcement learning framework that generates high-quality synthetic data to enhance multilingual guardrail models, achieving superior performance and efficiency over existing methods.

Authors:Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, Shi Feng
Title: SSMLoRA: Enhancing Low-Rank Adaptation with State Space Model
Abstract:
Fine-tuning is a key approach for adapting language models to specific downstream tasks, but updating all model parameters becomes impractical as model sizes increase. Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), address this challenge by introducing additional adaptation parameters into pre-trained weight matrices. However, LoRA's performance varies across different insertion points within the model, highlighting potential parameter inefficiency due to unnecessary insertions. To this end, we propose SSMLoRA (State Space Model Low-Rank Adaptation), an extension of LoRA that incorporates a State Space Model (SSM) to interconnect low-rank matrices. SSMLoRA ensures that performance is maintained even with sparser insertions. SSMLoRA allows the model to not only map inputs to a low-rank space for better feature extraction but also leverage the computations from the previous low-rank space. Our method achieves comparable performance to LoRA on the General Language Understanding Evaluation (GLUE) benchmark while using only half the parameters. Additionally, due to its structure, SSMLoRA shows promise in handling tasks with longer input sequences. .You can find our code here:https://github.com/yuhkalhic/SSMLoRA.
Chinese: SSMLoRA通过引入状态空间模型连接低秩矩阵,在GLUE基准测试中以仅一半参数实现与LoRA相当的性能,并展现出处理长输入序列的潜力。
English: SSMLoRA enhances LoRA by integrating a State Space Model to connect low-rank matrices, achieving comparable performance on the GLUE benchmark with only half the parameters and showing potential for longer input sequences.

Authors:Wonjun Lee, Doehyeon Lee, Eugene Choi, Sangyoon Yu, Ashkan Yousefpour, Haon Park, Bumsub Ham, Suhyun Kim
Title: ELITE: Enhanced Language-Image Toxicity Evaluation for Safety
Abstract:
Current Vision Language Models (VLMs) remain vulnerable to malicious prompts that induce harmful outputs. Existing safety benchmarks for VLMs primarily rely on automated evaluation methods, but these methods struggle to detect implicit harmful content or produce inaccurate evaluations. Therefore, we found that existing benchmarks have low levels of harmfulness, ambiguous data, and limited diversity in image-text pair combinations. To address these issues, we propose the ELITE benchmark, a high-quality safety evaluation benchmark for VLMs, underpinned by our enhanced evaluation method, the ELITE evaluator. The ELITE evaluator explicitly incorporates a toxicity score to accurately assess harmfulness in multimodal contexts, where VLMs often provide specific, convincing, but unharmful descriptions of images. We filter out ambiguous and low-quality image-text pairs from existing benchmarks using the ELITE evaluator and generate diverse combinations of safe and unsafe image-text pairs. Our experiments demonstrate that the ELITE evaluator achieves superior alignment with human evaluations compared to prior automated methods, and the ELITE benchmark offers enhanced benchmark quality and diversity. By introducing ELITE, we pave the way for safer, more robust VLMs, contributing essential tools for evaluating and mitigating safety risks in real-world applications.

Authors:Yuwei Yin, Giuseppe Carenini
Title: ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning
Abstract:
Large language models (LLMs) have demonstrated impressive capabilities on complex evaluation benchmarks, many of which are formulated as question-answering (QA) tasks. Enhancing the performance of LLMs in QA contexts is becoming increasingly vital for advancing their development and applicability. This paper introduces ARR, an intuitive, effective, and general QA solving method that explicitly incorporates three key steps: analyzing the intent of the question, retrieving relevant information, and reasoning step by step. Notably, this paper is the first to introduce intent analysis in QA, which plays a vital role in ARR. Comprehensive evaluations across 10 diverse QA tasks demonstrate that ARR consistently outperforms the baseline methods. Ablation and case studies further validate the positive contributions of each ARR component. Furthermore, experiments involving variations in prompt design indicate that ARR maintains its effectiveness regardless of the specific prompt formulation. Additionally, extensive evaluations across various model sizes, LLM series, and generation settings solidify the effectiveness, robustness, and generalizability of ARR.
中文: 本文提出的ARR方法通过意图分析、信息检索和逐步推理三大步骤,显著提升了大语言模型在问答任务中的表现,并在多种测试中展现出优于基线方法的稳定性和泛化能力。
English: This paper introduces ARR, a novel question-answering method that enhances LLM performance through intent analysis, information retrieval, and step-by-step reasoning, demonstrating consistent superiority across diverse tasks and robust generalizability.

Authors:Soichiro Murakami, Peinan Zhang, Hidetaka Kamigaito, Hiroya Takamura, Manabu Okumura
Title: AdParaphrase: Paraphrase Dataset for Analyzing Linguistic Features toward Generating Attractive Ad Texts
Abstract:
Effective linguistic choices that attract potential customers play crucial roles in advertising success. This study aims to explore the linguistic features of ad texts that influence human preferences. Although the creation of attractive ad texts is an active area of research, progress in understanding the specific linguistic features that affect attractiveness is hindered by several obstacles. First, human preferences are complex and influenced by multiple factors, including their content, such as brand names, and their linguistic styles, making analysis challenging. Second, publicly available ad text datasets that include human preferences are lacking, such as ad performance metrics and human feedback, which reflect people's interests. To address these problems, we present AdParaphrase, a paraphrase dataset that contains human preferences for pairs of ad texts that are semantically equivalent but differ in terms of wording and style. This dataset allows for preference analysis that focuses on the differences in linguistic features. Our analysis revealed that ad texts preferred by human judges have higher fluency, longer length, more nouns, and use of bracket symbols. Furthermore, we demonstrate that an ad text-generation model that considers these findings significantly improves the attractiveness of a given text. The dataset is publicly available at: https://github.com/CyberAgentAILab/AdParaphrase.
中文摘要:本研究推出AdParaphrase数据集,通过分析语言特征(如流畅度与措辞)对广告文本偏好的影响,证明整合这些发现能显著提升广告吸引力。
English Summary: This study introduces AdParaphrase, a dataset enabling analysis of how linguistic features like fluency and word choice influence ad text preferences, and demonstrates that incorporating these insights significantly enhances ad attractiveness.

Authors:Lin Tian, Emily Booth, Francesco Bailo, Julian Droogan, Marian-Andrei Rizoiu
Title: Before It's Too Late: A State Space Model for the Early Prediction of Misinformation and Disinformation Engagement
Abstract:
In today's digital age, conspiracies and information campaigns can emerge rapidly and erode social and democratic cohesion. While recent deep learning approaches have made progress in modeling engagement through language and propagation models, they struggle with irregularly sampled data and early trajectory assessment. We present IC-Mamba, a novel state space model that forecasts social media engagement by modeling interval-censored data with integrated temporal embeddings. Our model excels at predicting engagement patterns within the crucial first 15-30 minutes of posting (RMSE 0.118-0.143), enabling rapid assessment of content reach. By incorporating interval-censored modeling into the state space framework, IC-Mamba captures fine-grained temporal dynamics of engagement growth, achieving a 4.72% improvement over state-of-the-art across multiple engagement metrics (likes, shares, comments, and emojis). Our experiments demonstrate IC-Mamba's effectiveness in forecasting both post-level dynamics and broader narrative patterns (F1 0.508-0.751 for narrative-level predictions). The model maintains strong predictive performance across extended time horizons, successfully forecasting opinion-level engagement up to 28 days ahead using observation windows of 3-10 days. These capabilities enable earlier identification of potentially problematic content, providing crucial lead time for designing and implementing countermeasures. Code is available at: https://github.com/ltian678/ic-mamba. An interactive dashboard demonstrating our results is available at: https://ic-mamba.behavioral-ds.science.
中文: IC-Mamba是一种新颖的状态空间模型,通过结合时间嵌入处理区间删失数据,能有效预测社交媒体参与度,在早期轨迹评估和长期预测方面均表现优异。
English: IC-Mamba is a novel state space model that effectively forecasts social media engagement by modeling interval-censored data with temporal embeddings, achieving superior performance in early trajectory assessment and extended time horizon predictions.

Authors:Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin
Title: Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
Abstract:
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning
Chinese: Agentic Reasoning框架通过整合网络搜索、代码执行和思维导图代理等动态工具,增强了大型语言模型的推理能力,其性能达到了与领先专有模型相媲美的顶尖水平。
English: The Agentic Reasoning framework enhances large language model reasoning by integrating dynamic tools like web search, code execution, and a Mind-Map agent for structured knowledge tracking, achieving state-of-the-art performance comparable to leading proprietary models.

Authors:Brian Formento, Chuan Sheng Foo, See-Kiong Ng
Title: Confidence Elicitation: A New Attack Vector for Large Language Models
Abstract:
A fundamental issue in deep learning has been adversarial robustness. As these systems have scaled, such issues have persisted. Currently, large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. However, the threat models have changed. Previously, having gray-box access, where input embeddings or output logits/probabilities were visible to the user, might have been reasonable. However, with the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting. This is achieved through the ability to elicit confidence from the model. We empirically show that the elicited confidence is calibrated and not hallucinated for current LLMs. By minimizing the elicited confidence, we can therefore increase the likelihood of misclassification. Our new proposed paradigm demonstrates promising state-of-the-art results on three datasets across two models (LLaMA-3-8B-Instruct and Mistral-7B-Instruct-V0.3) when comparing our technique to existing hard-label black-box attack methods that introduce word-level substitutions.
中文: 本研究提出了一种新型黑盒攻击方法,通过从大语言模型中获取校准后的置信度分数来指导对抗性攻击,在无需模型内部信息的情况下仅通过词级替换就实现了最先进的误分类效果。
English: This study introduces a novel black-box attack method that guides adversarial attacks by eliciting calibrated confidence scores from large language models, achieving state-of-the-art misclassification rates through word-level substitutions without accessing internal model information.

Authors:Sandra C. Sandoval, Christabel Acquaye, Kwesi Cobbina, Mohammad Nayeem Teli, Hal Daumé
Title: My LLM might Mimic AAE -- But When Should it?
Abstract:
We examine the representation of African American English (AAE) in large language models (LLMs), exploring (a) the perceptions Black Americans have of how effective these technologies are at producing authentic AAE, and (b) in what contexts Black Americans find this desirable. Through both a survey of Black Americans ($n=$ 104) and annotation of LLM-produced AAE by Black Americans ($n=$ 228), we find that Black Americans favor choice and autonomy in determining when AAE is appropriate in LLM output. They tend to prefer that LLMs default to communicating in Mainstream U.S. English in formal settings, with greater interest in AAE production in less formal settings. When LLMs were appropriately prompted and provided in context examples, our participants found their outputs to have a level of AAE authenticity on par with transcripts of Black American speech. Select code and data for our project can be found here: https://github.com/smelliecat/AAEMime.git
中文摘要:本研究探讨了美国黑人对大型语言模型中非裔美国人英语(AAE)真实性与适用性的看法,发现他们更倾向于自主决定AAE的使用场景——在正式场合偏好主流英语,非正式场合则接受AAE输出,且经恰当提示的模型生成的AAE真实性可与真人语音媲美。
English Summary: The study investigates Black Americans' views on the authenticity and desirability of African American English (AAE) in large language models, finding they prefer autonomy in choosing AAE usage—favoring it in informal contexts while preferring mainstream English in formal settings, with appropriately prompted models achieving speech authenticity comparable to human transcripts.

Authors:Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Zachary Yahn, Ling Liu
Title: Multi-Agent Reinforcement Learning with Focal Diversity Optimization
Abstract:
The advancement of Large Language Models (LLMs) and their finetuning strategies has triggered the renewed interests in multi-agent reinforcement learning. In this paper, we introduce a focal diversity-optimized multi-agent reinforcement learning approach, coined as MARL-Focal, with three unique characteristics. First, we develop an agent-fusion framework for encouraging multiple LLM based agents to collaborate in producing the final inference output for each LLM query. Second, we develop a focal-diversity optimized agent selection algorithm that can choose a small subset of the available agents based on how well they can complement one another to generate the query output. Finally, we design a conflict-resolution method to detect output inconsistency among multiple agents and produce our MARL-Focal output through reward-aware and policy-adaptive inference fusion. Extensive evaluations on five benchmarks show that MARL-Focal is cost-efficient and adversarial-robust. Our multi-agent fusion model achieves performance improvement of 5.51\% compared to the best individual LLM-agent and offers stronger robustness over the TruthfulQA benchmark. Code is available at https://github.com/sftekin/rl-focal
中文摘要:本文提出的MARL-Focal方法通过多智能体协作框架、聚焦多样性优化选择机制和冲突解决方案,显著提升大语言模型的性能表现与抗干扰能力,在多个基准测试中展现出优越效果。
English Summary: This paper introduces MARL-Focal, a diversity-optimized multi-agent reinforcement learning approach that enhances LLM performance through agent collaboration, intelligent selection, and conflict resolution, achieving significant efficiency and robustness gains.

Authors:He Hu, Yucheng Zhou, Lianzhong You, Hongbo Xu, Qianning Wang, Zheng Lian, Fei Richard Yu, Fei Ma, Laizhong Cui
Title: EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models
Abstract:
With the integration of Multimodal large language models (MLLMs) into robotic systems and various AI applications, embedding emotional intelligence (EI) capabilities into these models is essential for enabling robots to effectively address human emotional needs and interact seamlessly in real-world scenarios. Existing static, text-based, or text-image benchmarks overlook the multimodal complexities of real-world interactions and fail to capture the dynamic, multimodal nature of emotional expressions, making them inadequate for evaluating MLLMs' EI. Based on established psychological theories of EI, we build EmoBench-M, a novel benchmark designed to evaluate the EI capability of MLLMs across 13 valuation scenarios from three key dimensions: foundational emotion recognition, conversational emotion understanding, and socially complex emotion analysis. Evaluations of both open-source and closed-source MLLMs on EmoBench-M reveal a significant performance gap between them and humans, highlighting the need to further advance their EI capabilities. All benchmark resources, including code and datasets, are publicly available at https://emo-gml.github.io/.

Authors:Xing Li, Zeyu Xing, Yiming Li, Linping Qu, Hui-Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan
Title: KVTuner: Sensitivity-Aware Layer-Wise Mixed-Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference
Abstract:
KV cache quantization can improve Large Language Models (LLMs) inference throughput and latency in long contexts and large batch-size scenarios while preserving LLMs effectiveness. However, current methods have three unsolved issues: overlooking layer-wise sensitivity to KV cache quantization, high overhead of online fine-grained decision-making, and low flexibility to different LLMs and constraints. Therefore, we theoretically analyze the inherent correlation of layer-wise transformer attention patterns to KV cache quantization errors and study why key cache is generally more important than value cache for quantization error reduction. We further propose a simple yet effective framework KVTuner to adaptively search for the optimal hardware-friendly layer-wise KV quantization precision pairs for coarse-grained KV cache with multi-objective optimization and directly utilize the offline searched configurations during online inference. To reduce the computational cost of offline calibration, we utilize the intra-layer KV precision pair pruning and inter-layer clustering to reduce the search space. Experimental results show that we can achieve nearly lossless 3.25-bit mixed precision KV cache quantization for LLMs like Llama-3.1-8B-Instruct and 4.0-bit for sensitive models like Qwen2.5-7B-Instruct on mathematical reasoning tasks. The maximum inference throughput can be improved by 21.25\% compared with KIVI-KV8 quantization over various context lengths. Our code and searched configurations are available at https://github.com/cmd2001/KVTuner.
中文摘要:KV缓存量化可提升大语言模型在长上下文和大批量场景下的推理效率,而KVTuner框架通过自适应优化分层量化精度,实现了近乎无损的性能和显著的吞吐量提升。
English Summary: KV cache quantization enhances LLM inference efficiency in long contexts and large batches, and the proposed KVTuner framework adaptively optimizes layer-wise quantization precision to achieve nearly lossless performance with significant throughput improvements.

Authors:Xuejiao Zhao, Siyan Liu, Su-Yin Yang, Chunyan Miao
Title: MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot
Abstract:
Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records (EHR). It can serve as a key module of the healthcare copilot, helping reduce misdiagnosis for healthcare practitioners and patients. However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations. This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain that retrieves diagnosis and treatment recommendations based on manifestations. MedRAG systematically constructs a comprehensive four-tier hierarchical diagnostic KG encompassing critical diagnostic differences of various diseases. These differences are dynamically integrated with similar EHRs retrieved from an EHR database, and reasoned within a large language model. This process enables more accurate and specific decision support, while also proactively providing follow-up questions to enhance personalized medical decision-making. MedRAG is evaluated on both a public dataset DDXPlus and a private chronic pain diagnostic dataset (CPDD) collected from Tan Tock Seng Hospital, and its performance is compared against various existing RAG methods. Experimental results show that, leveraging the information integration and relational abilities of the KG, our MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates. Our code will be available at https://github.com/SNOWTEAM2023/MedRAG
中文:MedRAG通过知识图谱推理增强检索生成技术,结合分层诊断知识与电子健康记录,显著提升医疗诊断的准确性和特异性,在降低误诊率方面优于现有方法。
English: MedRAG enhances retrieval-augmented generation with knowledge graph reasoning to improve diagnostic accuracy and specificity in healthcare by integrating hierarchical diagnostic knowledge with electronic health records, outperforming existing methods in reducing misdiagnosis.

Authors:Long Chen, Xiaotian Song, Andy Song, BaDong Chen, Jiancheng Lv, Yanan Sun
Title: FAS: Fast ANN-SNN Conversion for Spiking Large Language Models
Abstract:
Spiking Large Language Models have been shown as a good alternative to LLMs in various scenarios. Existing methods for creating Spiking LLMs, i.e., direct training and ANN-SNN conversion, often suffer from performance degradation and relatively high computational costs. To address these issues, we propose a novel Fast ANN-SNN conversion strategy (FAS) that transforms LLMs into spiking LLMs in two stages. The first stage employs a full-parameter fine-tuning of pre-trained models, so it does not need any direct training from scratch. The second stage introduces a coarse-to-fine calibration method to reduce conversion errors and improve accuracy. Experiments on both language and vision-language tasks across four different scales of LLMs demonstrate that FAS can achieve state-of-the-art performance yet with significantly reduced inference latency and computational costs. Notably, FAS only takes eight timesteps to achieve an accuracy of 3\% higher than that of the OPT-7B model, while reducing energy consumption by 96.63\%. The source code is available at https://github.com/lc783/FAS
中文: 本文提出了一种新颖的快速人工神经网络-脉冲神经网络转换策略(FAS),通过两阶段微调和校准将大语言模型转换为脉冲神经网络,在显著降低延迟和能耗的同时实现了最先进的性能。
English: This paper introduces a novel Fast ANN-SNN conversion strategy (FAS) that transforms large language models into spiking neural networks through two-stage fine-tuning and calibration, achieving state-of-the-art performance with significantly reduced latency and energy consumption.

Authors:Royson Lee, Minyoung Kim, Fady Rezk, Rui Li, Stylianos I. Venieris, Timothy Hospedales
Title: FedP$^2$EFT: Federated Learning to Personalize PEFT for Multilingual LLMs
Abstract:
Federated learning (FL) has enabled the training of multilingual large language models (LLMs) on diverse and decentralized multilingual data, especially on low-resource languages. To improve client-specific performance, personalization via the use of parameter-efficient fine-tuning (PEFT) modules such as LoRA is common. This involves a personalization strategy (PS), such as the design of the PEFT adapter structures (e.g., in which layers to add LoRAs and what ranks) and choice of hyperparameters (e.g., learning rates) for fine-tuning. Instead of manual PS configuration, we propose FedP$^2$EFT, a federated learning-to-personalize method for multilingual LLMs in cross-device FL settings. Unlike most existing PEFT structure selection methods, which are prone to overfitting low-data regimes, FedP$^2$EFT collaboratively learns the optimal personalized PEFT structure for each client via Bayesian sparse rank selection. Evaluations on both simulated and real-world multilingual FL benchmarks demonstrate that FedP$^2$EFT largely outperforms existing personalized fine-tuning methods, while complementing other existing FL methods. Code is available at https://github.com/SamsungLabs/fedp2eft.
中文:FedP$^2$EFT提出了一种联邦学习方法,通过贝叶斯稀疏秩选择协同优化多语言大模型的个性化高效参数微调结构,在跨设备联邦学习基准测试中显著优于现有方法。
English: FedP$^2$EFT introduces a federated learning method that collaboratively optimizes personalized parameter-efficient fine-tuning structures for multilingual LLMs using Bayesian sparse rank selection, significantly outperforming existing approaches on cross-device FL benchmarks.

Authors:Rajiv Movva, Kenny Peng, Nikhil Garg, Jon Kleinberg, Emma Pierson
Title: Sparse Autoencoders for Hypothesis Generation
Abstract:
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., "mentions being surprised or shocked") using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
HypotheSAEs 是一种计算效率高的方法,通过稀疏自编码器和大型语言模型生成可解释的文本特征假设,用于预测目标变量,在合成和真实数据集上均比基线方法更准确地识别假设并产生更多新发现。
HypotheSAEs is a computationally efficient method that uses sparse autoencoders and LLMs to generate interpretable hypotheses about text features predicting target variables, outperforming baselines in accuracy and discovery on both synthetic and real datasets.

Authors:Afshin Khadangi, Amir Sartipi, Igor Tchappi, Gilbert Fridgen
Title: CognArtive: Large Language Models for Automating Art Analysis and Decoding Aesthetic Elements
Abstract:
Art, as a universal language, can be interpreted in diverse ways, with artworks embodying profound meanings and nuances. The advent of Large Language Models (LLMs) and the availability of Multimodal Large Language Models (MLLMs) raise the question of how these transformative models can be used to assess and interpret the artistic elements of artworks. While research has been conducted in this domain, to the best of our knowledge, a deep and detailed understanding of the technical and expressive features of artworks using LLMs has not been explored. In this study, we investigate the automation of a formal art analysis framework to analyze a high-throughput number of artworks rapidly and examine how their patterns evolve over time. We explore how LLMs can decode artistic expressions, visual elements, composition, and techniques, revealing emerging patterns that develop across periods. Finally, we discuss the strengths and limitations of LLMs in this context, emphasizing their ability to process vast quantities of art-related data and generate insightful interpretations. Due to the exhaustive and granular nature of the results, we have developed interactive data visualizations, available online https://cognartive.github.io/, to enhance understanding and accessibility.

Authors:Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan
Title: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
Abstract:
Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.
中文: CodeSteer是一种创新方法,能有效引导大型语言模型在文本推理与代码生成间切换,显著提升其在各类任务中的符号计算性能。
English: CodeSteer is an innovative method that enhances LLMs' ability to switch between textual reasoning and code generation, significantly boosting their symbolic computing performance across diverse tasks.

Authors:Juyun Wee, Minjae Park, Jaeho Lee
Title: Prompt-based Depth Pruning of Large Language Models
Abstract:
Depth pruning aims to reduce the inference cost of a large language model without any hardware-specific complications, by simply removing several less important transformer blocks. However, our empirical findings suggest that the importance of a transformer block may be highly task-dependent -- a block that is crucial for a task can be removed without degrading the accuracy on another task. Based on this observation, we develop a dynamic depth pruning algorithm, coined PuDDing (Prompt-routed Dynamic Depth Pruning), which determines which blocks to omit from the model based on the input prompt. PuDDing operates by training a lightweight router to predict the best omission set among a set of options, where this option set has also been constructed in a data-driven manner. Empirical results on commonsense reasoning benchmarks demonstrate that PuDDing effectively accelerates the inference language models, and achieves better on-task performance than static depth pruning baselines.
Chinese: 深度剪枝通过移除不太重要的Transformer模块来降低大型语言模型的推理成本,而新提出的PuDDing方法根据输入提示动态调整模块剪枝策略,相比静态方法在任务特定性能和效率上表现更优。
English: Depth pruning reduces large language model inference costs by removing less important transformer blocks, but a new dynamic method called PuDDing adapts block removal based on input prompts for better task-specific performance and efficiency than static approaches.

Authors:Saydul Akbar Murad, Ashim Dahal, Nick Rahimi
Title: Multi-Lingual Cyber Threat Detection in Tweets/X Using ML, DL, and LLM: A Comparative Analysis
Abstract:
Cyber threat detection has become an important area of focus in today's digital age due to the growing spread of fake information and harmful content on social media platforms such as Twitter (now 'X'). These cyber threats, often disguised within tweets, pose significant risks to individuals, communities, and even nations, emphasizing the need for effective detection systems. While previous research has explored tweet-based threats, much of the work is limited to specific languages, domains, or locations, or relies on single-model approaches, reducing their applicability to diverse real-world scenarios. To address these gaps, our study focuses on multi-lingual tweet cyber threat detection using a variety of advanced models. The research was conducted in three stages: (1) We collected and labeled tweet datasets in four languages English, Chinese, Russian, and Arabic employing both manual and polarity-based labeling methods to ensure high-quality annotations. (2) Each dataset was analyzed individually using machine learning (ML) and deep learning (DL) models to assess their performance on distinct languages. (3) Finally, we combined all four datasets into a single multi-lingual dataset and applied DL and large language model (LLM) architectures to evaluate their efficacy in identifying cyber threats across various languages. Our results show that among machine learning models, Random Forest (RF) attained the highest performance; however, the Bi-LSTM architecture consistently surpassed other DL and LLM architectures across all datasets. These findings underline the effectiveness of Bi-LSTM in multilingual cyber threat detection. The code for this paper can be found at this link: https://github.com/Mmurrad/Tweet-Data-Classification.git.
中文: 本研究通过结合多种机器学习与深度学习模型,开发了针对多语言推文的网络威胁检测方法,其中双向长短期记忆网络(Bi-LSTM)在所有语言数据集中均表现出最优性能。
English: This study addresses the limitations of previous cyber threat detection methods on Twitter by developing a multilingual approach using machine learning and deep learning models, with Bi-LSTM emerging as the most effective across diverse languages.

Authors:Zuyan Liu, Yuhao Dong, Jiahui Wang, Ziwei Liu, Winston Hu, Jiwen Lu, Yongming Rao
Title: Ola: Pushing the Frontiers of Omni-Modal Language Model
Abstract:
Recent advances in large language models, particularly following GPT-4o, have sparked increasing interest in developing omni-modal models capable of understanding more modalities. While some open-source alternatives have emerged, there is still a notable lag behind specialized single-modality models in performance. In this paper, we present Ola, an Omni-modal Language model that achieves competitive performance across image, video, and audio understanding compared to specialized counterparts, pushing the frontiers of the omni-modal language model to a large extent. We conduct a comprehensive exploration of architectural design, data curation, and training strategies essential for building a robust omni-modal model. Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements over mainstream baselines. Moreover, we rethink inter-modal relationships during omni-modal training, emphasizing cross-modal alignment with video as a central bridge, and propose a progressive training pipeline that begins with the most distinct modalities and gradually moves towards closer modality alignment. Extensive experiments demonstrate that Ola surpasses existing open omni-modal LLMs across all modalities while achieving highly competitive performance compared to state-of-the-art specialized models of similar sizes. We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field. Model weights, code, and data are open-sourced at https://github.com/Ola-Omni/Ola.
中文摘要:本文介绍了Ola全模态语言模型,通过架构改进和渐进式训练方法,在图像、视频和音频理解方面实现了与专业模型相媲美的性能,并完全开源以推动该领域未来发展。
English summary: The paper introduces Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding through architectural improvements and a progressive training approach, while being fully open-sourced to advance future research.

Authors:Yinjie Wang, Ling Yang, Guohao Li, Mengdi Wang, Bryon Aragam
Title: ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization
Abstract:
Recent research has leveraged large language model multi-agent systems for complex problem-solving while trying to reduce the manual effort required to build them, driving the development of automated agent workflow optimization methods. However, existing methods remain inflexible due to representational limitations, a lack of adaptability, and poor scalability when relying on discrete optimization techniques. We address these challenges with ScoreFlow, a simple yet high-performance framework that leverages efficient gradient-based optimization in a continuous space. ScoreFlow incorporates Score-DPO, a novel variant of the direct preference optimization method that accounts for quantitative feedback. Across six benchmarks spanning question answering, coding, and mathematical reasoning, ScoreFlow achieves an 8.2% improvement over existing baselines. Moreover, it empowers smaller models to outperform larger ones with lower inference costs. Project: https://github.com/Gen-Verse/ScoreFlow
中文:ScoreFlow采用基于梯度的优化框架和创新的Score-DPO方法,显著提升了自动化智能体工作流程的效率,在多项基准测试中性能提高8.2%,并让小模型以更低成本超越大模型表现。
English: ScoreFlow introduces a gradient-based optimization framework with a novel Score-DPO method to enhance automated agent workflow efficiency, achieving an 8.2% performance boost across benchmarks and enabling smaller models to outperform larger ones cost-effectively.

Authors:Yuanye Liu, Jiahang Xu, Li Lyna Zhang, Qi Chen, Xuan Feng, Yang Chen, Zhongxin Guo, Yuqing Yang, Peng Cheng
Title: Beyond Prompt Content: Enhancing LLM Performance via Content-Format Integrated Prompt Optimization
Abstract:
Large Language Models (LLMs) have shown significant capability across various tasks, with their real-world effectiveness often driven by prompt design. While recent research has focused on optimizing prompt content, the role of prompt formatting, a critical but often overlooked dimension, has received limited systematic investigation. In this paper, we introduce Content-Format Integrated Prompt Optimization (CFPO), an innovative methodology that jointly optimizes both prompt content and formatting through an iterative refinement process. CFPO leverages natural language mutations to explore content variations and employs a dynamic format exploration strategy that systematically evaluates diverse format options. Our extensive evaluations across multiple tasks and open-source LLMs demonstrate that CFPO demonstrates measurable performance improvements compared to content-only optimization methods. This highlights the importance of integrated content-format optimization and offers a practical, model-agnostic approach to enhancing LLM performance. Code is available at https://github.com/HenryLau7/CFPO.
Chinese: CFPO是一种通过迭代优化同时改进提示内容和格式的新方法,在多项任务和开源大语言模型上相比仅优化内容的方法均展现出显著性能提升。
English: CFPO is a novel method that jointly optimizes both prompt content and formatting through iterative refinement, demonstrating measurable performance improvements over content-only optimization across various tasks and LLMs.

Authors:Kaikai An, Li Sheng, Ganqu Cui, Shuzheng Si, Ning Ding, Yu Cheng, Baobao Chang
Title: UltraIF: Advancing Instruction Following from the Wild
Abstract:
Instruction-following made modern large language models (LLMs) helpful assistants. However, the key to taming LLMs on complex instructions remains mysterious, for that there are huge gaps between models trained by open-source community and those trained by leading companies. To bridge the gap, we propose a simple and scalable approach UltraIF for building LLMs that can follow complex instructions with open-source data. UltraIF first decomposes real-world user prompts into simpler queries, constraints, and corresponding evaluation questions for the constraints. Then, we train an UltraComposer to compose constraint-associated prompts with evaluation questions. This prompt composer allows us to synthesize complicated instructions as well as filter responses with evaluation questions. In our experiment, for the first time, we successfully align LLaMA-3.1-8B-Base to catch up with its instruct version on 5 instruction-following benchmarks without any benchmark information, using only 8B model as response generator and evaluator. The aligned model also achieved competitive scores on other benchmarks. Moreover, we also show that UltraIF could further improve LLaMA-3.1-8B-Instruct through self-alignment, motivating broader use cases for the method. Our code will be available at https://github.com/kkk-an/UltraIF.
中文摘要:UltraIF是一种通过将复杂指令分解为简单组件并训练合成器来生成和评估指令的可扩展方法,有效缩小了开源与领先大语言模型之间的性能差距,成功使基础模型在多个基准测试中达到指导版本的同等水平。
English Summary: UltraIF is a scalable method that bridges the performance gap between open-source and leading LLMs by decomposing complex instructions into simpler components and training a composer to synthesize and evaluate them, successfully aligning base models to match their instruct versions on benchmarks.

Authors:Longquan Jiang, Junbo Huang, Cedric Möller, Ricardo Usbeck
Title: Ontology-Guided, Hybrid Prompt Learning for Generalization in Knowledge Graph Question Answering
Abstract:
Most existing Knowledge Graph Question Answering (KGQA) approaches are designed for a specific KG, such as Wikidata, DBpedia or Freebase. Due to the heterogeneity of the underlying graph schema, topology and assertions, most KGQA systems cannot be transferred to unseen Knowledge Graphs (KGs) without resource-intensive training data. We present OntoSCPrompt, a novel Large Language Model (LLM)-based KGQA approach with a two-stage architecture that separates semantic parsing from KG-dependent interactions. OntoSCPrompt first generates a SPARQL query structure (including SPARQL keywords such as SELECT, ASK, WHERE and placeholders for missing tokens) and then fills them with KG-specific information. To enhance the understanding of the underlying KG, we present an ontology-guided, hybrid prompt learning strategy that integrates KG ontology into the learning process of hybrid prompts (e.g., discrete and continuous vectors). We also present several task-specific decoding strategies to ensure the correctness and executability of generated SPARQL queries in both stages. Experimental results demonstrate that OntoSCPrompt performs as well as SOTA approaches without retraining on a number of KGQA datasets such as CWQ, WebQSP and LC-QuAD 1.0 in a resource-efficient manner and can generalize well to unseen domain-specific KGs like DBLP-QuAD and CoyPu KG Code: \href{https://github.com/LongquanJiang/OntoSCPrompt}{https://github.com/LongquanJiang/OntoSCPrompt}
中文: OntoSCPrompt提出了一种新颖的两阶段大语言模型知识图谱问答方法,通过分离语义解析与图谱交互,并采用本体引导的混合提示学习策略,无需重新训练即可高效泛化到未见过的知识图谱。
English: OntoSCPrompt introduces a novel two-stage LLM-based KGQA approach that separates semantic parsing from KG interactions, enabling efficient generalization to unseen knowledge graphs without retraining through ontology-guided prompts and task-specific decoding strategies.

Authors:Minsang Kim, Seungjun Baek
Title: Syntriever: How to Train Your Retriever with Synthetic Data from LLMs
Abstract:
LLMs have boosted progress in many AI applications. Recently, there were attempts to distill the vast knowledge of LLMs into information retrieval systems. Those distillation methods mostly use output probabilities of LLMs which are unavailable in the latest black-box LLMs. We propose Syntriever, a training framework for retrievers using synthetic data from black-box LLMs. Syntriever consists of two stages. Firstly in the distillation stage, we synthesize relevant and plausibly irrelevant passages and augmented queries using chain-of-thoughts for the given queries. LLM is asked to self-verify the synthetic data for possible hallucinations, after which retrievers are trained with a loss designed to cluster the embeddings of relevant passages. Secondly in the alignment stage, we align the retriever with the preferences of LLMs. We propose a preference modeling called partial Plackett-Luce ranking to learn LLM preferences with regularization which prevents the model from deviating excessively from that trained in the distillation stage. Experiments show that Syntriever achieves state-of-the-art performances on benchmark datasets from various domains in nDCG@$K$. The code is available at \href{https://github.com/kmswin1/Syntriever}{https://github.com/kmswin1/Syntriever}.
中文: Syntriever是一种创新的训练框架,通过合成数据和包含蒸馏与对齐的两阶段过程,将黑盒大语言模型的知识提炼到检索系统中,并在多个领域的基准数据集上取得了领先性能。
English: Syntriever is a novel training framework that distills knowledge from black-box LLMs into retrieval systems using synthetic data and a two-stage process involving distillation and alignment, achieving state-of-the-art performance across multiple domains.

Authors:Xiaopeng Li, Shanwen Wang, Shasha Li, Shezheng Song, Bin Ji, Jun Ma, Jie Yu
Title: Rethinking the Residual Distribution of Locate-then-Editing Methods in Model Editing
Abstract:
Model editing is a powerful technique for updating the knowledge of Large Language Models (LLMs). Locate-then-edit methods are a popular class of approaches that first identify the critical layers storing knowledge, then compute the residual of the last critical layer based on the edited knowledge, and finally perform multi-layer updates using a least-squares solution by evenly distributing the residual from the first critical layer to the last. Although these methods achieve promising results, they have been shown to degrade the original knowledge of LLMs. We argue that residual distribution leads to this issue. To explore this, we conduct a comprehensive analysis of residual distribution in locate-then-edit methods from both empirical and theoretical perspectives, revealing that residual distribution introduces editing errors, leading to inaccurate edits. To address this issue, we propose the Boundary Layer UpdatE (BLUE) strategy to enhance locate-then-edit methods. Sequential batch editing experiments on three LLMs and two datasets demonstrate that BLUE not only delivers an average performance improvement of 35.59\%, significantly advancing the state of the art in model editing, but also enhances the preservation of LLMs' general capabilities. Our code is available at https://github.com/xpq-tech/BLUE.
中文: 模型编辑更新大语言模型知识,但现有定位后编辑方法因残差分布导致知识退化,而提出的BLUE策略通过边界层更新解决了该问题,实现了35.59%的性能提升并更好地保持了模型通用能力。
English: Model editing updates Large Language Models' knowledge, but current locate-then-edit methods degrade original knowledge due to residual distribution errors, which the proposed BLUE strategy addresses by improving performance by 35.59% while better preserving general capabilities.

Authors:Joshua Vendrow, Edward Vendrow, Sara Beery, Aleksander Madry
Title: Do Large Language Model Benchmarks Test Reliability?
Abstract:
When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus on measuring their reliability. To understand the potential ramifications of this gap, we investigate how well current benchmarks quantify model reliability. We find that pervasive label errors can compromise these evaluations, obscuring lingering model failures and hiding unreliable behavior. Motivated by this gap in the evaluation of reliability, we then propose the concept of so-called platinum benchmarks, i.e., benchmarks carefully curated to minimize label errors and ambiguity. As a first attempt at constructing such benchmarks, we revise examples from fifteen existing popular benchmarks. We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks such as elementary-level math word problems. Analyzing these failures further reveals previously unidentified patterns of problems on which frontier models consistently struggle. We provide code at https://github.com/MadryLab/platinum-benchmarks
中文摘要:现有大语言模型基准常存在标签错误而掩盖可靠性问题,为此提出的铂金基准揭示了先进模型在简单任务上仍存在持续性缺陷。
English Summary: Current benchmarks for large language models often contain label errors that obscure reliability issues, prompting the creation of platinum benchmarks which reveal persistent failures in even advanced models on simple tasks.

Authors:Rui Pan, Boyao Wang, Shizhe Diao, Xingyuan Pan, Jipeng Zhang, Renjie Pi, Tong Zhang
Title: Adapt-Pruner: Adaptive Structural Pruning for Efficient Small Language Model Training
Abstract:
Small language models (SLMs) have attracted considerable attention from both academia and industry due to their broad range of applications in edge devices. To obtain SLMs with strong performance, conventional approaches either pre-train the models from scratch, which incurs substantial computational costs, or compress/prune existing large language models (LLMs), which results in performance drops and falls short in comparison to pre-training. In this paper, we investigate the family of acceleration methods that involve both structured pruning and model training. We found 1) layer-wise adaptive pruning (Adapt-Pruner) is extremely effective in LLMs and yields significant improvements over existing pruning techniques, 2) adaptive pruning equipped with further training leads to models comparable to those pre-training from scratch, 3) incremental pruning brings non-trivial performance gain by interleaving pruning with training and only removing a small portion of neurons ($\sim$5%) at a time. Experimental results on LLaMA-3.1-8B demonstrate that Adapt-Pruner outperforms conventional pruning methods, such as LLM-Pruner, FLAP, and SliceGPT, by an average of 1%-7% in accuracy on commonsense benchmarks. Additionally, Adapt-Pruner restores the performance of MobileLLM-125M to 600M on the MMLU benchmark with 200$\times$ fewer tokens via pruning from its larger counterparts, and discovers a new 1B model that surpasses LLaMA-3.2-1B in multiple benchmarks. The official code is released at https://github.com/research4pan/AdaptPruner.
中文: 自适应剪枝结合增量训练使小型语言模型在显著降低计算成本的同时,性能可媲美预训练模型,并优于传统剪枝方法。
English: Adaptive pruning combined with incremental training enables small language models to achieve performance comparable to pre-trained models while significantly reducing computational costs and outperforming conventional pruning methods.

Authors:Hongli Zhan, Muneeza Azmat, Raya Horesh, Junyi Jessy Li, Mikhail Yurochkin
Title: SPRI: Aligning Large Language Models with Context-Situated Principles
Abstract:
Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public.
将大型语言模型与人类价值观对齐因依赖人工监督成本高昂而困难重重,但SPRI框架通过为每个查询自动生成实时情境化原则,无需大量人工介入即可提升模型表现与真实性。
Aligning Large Language Models with human values is challenging due to the high cost of human oversight, but the SPRI framework addresses this by automatically generating real-time, context-specific principles for each query, enhancing performance and truthfulness without extensive human input.

Authors:Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, Xiang Yue
Title: Demystifying Long Chain-of-Thought Reasoning in LLMs
Abstract:
Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.
中文: 本研究通过监督微调和强化学习的系统实验,揭示了增强大语言模型中长思维链推理能力的关键因素,包括奖励塑造和训练计算扩展,并发现基础模型虽具备核心能力但需精细优化策略。
English: This study systematically investigates how to enhance long chain-of-thought reasoning in large language models through supervised fine-tuning and reinforcement learning, identifying key factors like reward shaping and training compute scaling while revealing that core abilities exist in base models but require careful optimization.

Authors:Mohannad Takrouri, Nicolás M. Cuadrado, Martin Takáč
Title: Knowledge Distillation from Large Language Models for Household Energy Modeling
Abstract:
Machine learning (ML) is increasingly vital for smart-grid research, yet restricted access to realistic, diverse data - often due to privacy concerns - slows progress and fuels doubts within the energy sector about adopting ML-based strategies. We propose integrating Large Language Models (LLMs) in energy modeling to generate realistic, culturally sensitive, and behavior-specific data for household energy usage across diverse geographies. In this study, we employ and compare five different LLMs to systematically produce family structures, weather patterns, and daily consumption profiles for households in six distinct countries. A four-stage methodology synthesizes contextual daily data, including culturally nuanced activities, realistic weather ranges, HVAC operations, and distinct `energy signatures' that capture unique consumption footprints. Additionally, we explore an alternative strategy where external weather datasets can be directly integrated, bypassing intermediate weather modeling stages while ensuring physically consistent data inputs. The resulting dataset provides insights into how cultural, climatic, and behavioral factors converge to shape carbon emissions, offering a cost-effective avenue for scenario-based energy optimization. This approach underscores how prompt engineering, combined with knowledge distillation, can advance sustainable energy research and climate mitigation efforts. Source code is available at https://github.com/Singularity-AI-Lab/LLM-Energy-Knowledge-Distillation .
中文摘要:本研究提出利用大型语言模型生成真实且具有文化敏感性的家庭能源数据,解决了智能电网研究中数据稀缺的问题,并能够对文化、气候和行为因素共同影响的碳排放进行成本效益分析。
English Summary: This study introduces a method using Large Language Models to generate realistic and culturally sensitive household energy data, addressing data scarcity in smart-grid research and enabling cost-effective analysis of carbon emissions influenced by cultural, climatic, and behavioral factors.

Authors:Seng Pei Liew, Takuya Kato, Sho Takase
Title: Scaling Laws for Upcycling Mixture-of-Experts Language Models
Abstract:
Pretraining large language models (LLMs) is resource-intensive, often requiring months of training time even with high-end GPU clusters. There are two approaches of mitigating such computational demands: reusing smaller models to train larger ones (upcycling), and training computationally efficient models like mixture-of-experts (MoE). In this paper, we study the upcycling of LLMs to MoE models, of which the scaling behavior remains underexplored. Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. Based on these findings, we provide guidance to scale upcycling, and establish conditions under which upcycling outperforms from-scratch trainings within budget constraints.
中文: 本研究探索了将大型语言模型升级为专家混合架构的扩展规律,通过实证发现性能随数据集规模和模型配置扩展而提升,但识别出密集与升级数据集间的限制性交互作用,最终为预算内高效升级提供了策略指导。
English: This study explores the scaling behavior of upcycling large language models into mixture-of-experts architectures, revealing empirical laws that show performance gains from scaling dataset size and model configuration but identify a limiting interaction between dense and upcycled datasets, ultimately providing guidance for cost-effective upcycling strategies.

Authors:T. Chay-intr, Y. Chen, K. Viriyayudhakorn, T. Theeramunkong
Title: LLaVAC: Fine-tuning LLaVA as a Multimodal Sentiment Classifier
Abstract:
We present LLaVAC, a method for constructing a classifier for multimodal sentiment analysis. This method leverages fine-tuning of the Large Language and Vision Assistant (LLaVA) to predict sentiment labels across both image and text modalities. Our approach involves designing a structured prompt that incorporates both unimodal and multimodal labels to fine-tune LLaVA, enabling it to perform sentiment classification effectively. Experiments on the MVSA-Single dataset demonstrate that LLaVAC outperforms existing methods in multimodal sentiment analysis across three data processing procedures. The implementation of LLaVAC is publicly available at https://github.com/tchayintr/llavac.
中文: LLaVAC方法通过设计结构化提示对LLaVA模型进行微调,在MVSA-Single数据集上的多模态情感分析任务中表现优于现有方法。
English: LLaVAC is a method that fine-tunes the LLaVA model with structured prompts for multimodal sentiment analysis, achieving superior performance on the MVSA-Single dataset compared to existing approaches.

Authors:Linghe Wang, Minhwa Lee, Ross Volkov, Luan Tuyen Chau, Dongyeop Kang
Title: ScholaWrite: A Dataset of End-to-End Scholarly Writing Process
Abstract:
Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers' cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing), demonstrating the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified data examples and code are available on our project page.

Authors:Bradley P. Allen, Paul T. Groth
Title: A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs
Abstract:
Evaluating large language models (LLMs) for tasks like fact extraction in support of knowledge graph construction frequently involves computing accuracy metrics using a ground truth benchmark based on a knowledge graph (KG). These evaluations assume that errors represent factual disagreements. However, human discourse frequently features metalinguistic disagreement, where agents differ not on facts but on the meaning of the language used to express them. Given the complexity of natural language processing and generation using LLMs, we ask: do metalinguistic disagreements occur between LLMs and KGs? Based on an investigation using the T-REx knowledge alignment dataset, we hypothesize that metalinguistic disagreement does in fact occur between LLMs and KGs, with potential relevance for the practice of knowledge graph engineering. We propose a benchmark for evaluating the detection of factual and metalinguistic disagreements between LLMs and KGs. An initial proof of concept of such a benchmark is available on Github.
中文: 本研究探讨了大型语言模型与知识图谱之间是否存在元语言分歧,并基于T-REx数据集提出了一个检测事实性和元语言分歧的基准。
English: This study investigates whether metalinguistic disagreements occur between large language models and knowledge graphs, proposing a benchmark for detecting both factual and metalinguistic discrepancies based on the T-REx dataset.

Authors:Mayuka Jayawardhana, Renbo, Samuel Dooley, Valeriia Cherepanova, Andrew Gordon Wilson, Frank Hutter, Colin White, Tom Goldstein, Micah Goldblum
Title: Transformers Boost the Performance of Decision Trees on Tabular Data across Sample Sizes
Abstract:
Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. Similarly, TabPFN, a recent non-LLM transformer pretrained on numerous tables for in-context learning, has demonstrated excellent performance for dataset sizes up to a thousand samples. In contrast, gradient-boosted decision trees (GBDTs) are typically trained from scratch on each dataset without benefiting from pretraining data and must learn the relationships between columns from their entries alone since they lack natural language understanding. LLMs and TabPFN excel on small tabular datasets where a strong prior is essential, yet they are not competitive with GBDTs on medium or large datasets, since their context lengths are limited. In this paper, we propose a simple and lightweight approach for fusing large language models and TabPFN with gradient-boosted decision trees, which allows scalable GBDTs to benefit from the natural language capabilities and pretraining of transformers. We name our fusion methods LLM-Boost and PFN-Boost, respectively. While matching or surpassing the performance of the transformer at sufficiently small dataset sizes and GBDTs at sufficiently large sizes, LLM-Boost and PFN-Boost outperform both standalone components on a wide range of dataset sizes in between. We demonstrate state-of-the-art performance against numerous baselines and ensembling algorithms. We find that PFN-Boost achieves the best average performance among all methods we test for all but very small dataset sizes. We release our code at http://github.com/MayukaJ/LLM-Boost .
中文: 大语言模型和TabPFN在小规模表格数据上表现出色,而梯度提升决策树在大规模数据上更优,因此作者提出LLM-Boost和PFN-Boost融合方法,结合两者优势在不同规模数据集上均实现最优性能。
English: Large language models and TabPFN excel on small tabular datasets but are outperformed by gradient-boosted decision trees on larger ones, so the authors propose LLM-Boost and PFN-Boost fusion methods that combine their strengths to achieve state-of-the-art performance across various dataset sizes.

Authors:Yan Li, Tianyi Zhang, Zechuan Li, Soyeon Caren Han
Title: A Training-Free Length Extrapolation Approach for LLMs: Greedy Attention Logit Interpolation (GALI)
Abstract:
Transformer-based Large Language Models (LLMs) struggle with inputs exceeding their training context window due to positional out-of-distribution (O.O.D.) issues that disrupt attention. Existing solutions, including fine-tuning and training-free methods, face challenges like inefficiency, redundant interpolation, logit outliers, or loss of local positional information. We propose Greedy Attention Logit Interpolation (GALI), a training-free method that improves length extrapolation by greedily reusing pretrained positional intervals and interpolating attention logit to eliminate outliers. GALI achieves stable and superior performance across a wide range of long-context tasks without requiring input-length-specific tuning. Our analysis further reveals that LLMs interpret positional intervals unevenly and that restricting interpolation to narrower ranges improves performance, even on short-context tasks. GALI represents a step toward more robust and generalizable long-text processing in LLMs. Our implementation of GALI, along with the experiments from our paper, is open-sourced at https://github.com/adlnlp/Gali.
中文: GALI是一种无需训练的方法,通过重用位置区间和插值注意力对数来提升大语言模型的长度外推能力,无需特定调优即可在长文本任务中实现稳定优越的性能。
English: GALI is a training-free method that enhances length extrapolation in LLMs by reusing positional intervals and interpolating attention logits, achieving stable performance across long-context tasks without specific tuning.

Authors:Alex Flückiger, Chantal Amrhein, Tim Graf, Frédéric Odermatt, Martin Pömsl, Philippe Schläpfer, Florian Schottmann, Samuel Läubli
Title: A comparison of translation performance between DeepL and Supertext
Abstract:
As strong machine translation (MT) systems are increasingly based on large language models (LLMs), reliable quality benchmarking requires methods that capture their ability to leverage extended context. This study compares two commercial MT systems -- DeepL and Supertext -- by assessing their performance on unsegmented texts. We evaluate translation quality across four language directions with professional translators assessing segments with full document-level context. While segment-level assessments indicate no strong preference between the systems in most cases, document-level analysis reveals a preference for Supertext in three out of four language directions, suggesting superior consistency across longer texts. We advocate for more context-sensitive evaluation methodologies to ensure that MT quality assessments reflect real-world usability. We release all evaluation data and scripts for further analysis and reproduction at https://github.com/supertext/evaluation_deepl_supertext.
中文: 本研究通过上下文感知评估比较DeepL和Supertext机器翻译系统,发现Supertext在长文本中表现更稳定,并倡导采用更多文档级评估方法。
English: This study compares DeepL and Supertext machine translation systems using context-aware evaluation, revealing Supertext's superior consistency in longer texts and advocating for more document-level assessment methods.

Authors:Abdelrahman Abdallah, Bhawna Piryani, Jamshid Mozafari, Mohammed Ali, Adam Jatowt
Title: Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation
Abstract:
Retrieval, re-ranking, and retrieval-augmented generation (RAG) are critical components of modern applications in information retrieval, question answering, or knowledge-based text generation. However, existing solutions are often fragmented, lacking a unified framework that easily integrates these essential processes. The absence of a standardized implementation, coupled with the complexity of retrieval and re-ranking workflows, makes it challenging for researchers to compare and evaluate different approaches in a consistent environment. While existing toolkits such as Rerankers and RankLLM provide general-purpose reranking pipelines, they often lack the flexibility required for fine-grained experimentation and benchmarking. In response to these challenges, we introduce Rankify, a powerful and modular open-source toolkit designed to unify retrieval, re-ranking, and RAG within a cohesive framework. Rankify supports a wide range of retrieval techniques, including dense and sparse retrievers, while incorporating state-of-the-art re-ranking models to enhance retrieval quality. Additionally, Rankify includes a collection of pre-retrieved datasets to facilitate benchmarking, available at Huggingface (https://huggingface.co/datasets/abdoelsayed/reranking-datasets-light). To encourage adoption and ease of integration, we provide comprehensive documentation (http://rankify.readthedocs.io/), an open-source implementation on GitHub (https://github.com/DataScienceUIBK/rankify), and a PyPI package for easy installation (https://pypi.org/project/rankify/). As a unified and lightweight framework, Rankify allows researchers and practitioners to advance retrieval and re-ranking methodologies while ensuring consistency, scalability, and ease of use.
中文: Rankify是一个模块化的开源工具包,将检索、重排序和检索增强生成统一在一个集成框架中,为研究人员和从业者提供全面的工具和数据集,以促进一致的实验和基准测试。
English: Rankify is a modular open-source toolkit that unifies retrieval, re-ranking, and retrieval-augmented generation in a cohesive framework, offering comprehensive tools and datasets to facilitate consistent experimentation and benchmarking for researchers and practitioners.

Authors:Qianhao Yuan, Yanjiang Liu, Yaojie Lu, Hongyu Lin, Ben He, Xianpei Han, Le Sun
Title: SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency
Abstract:
Multimodal Large Language Models (MLLMs) mainly fall into two architectures, each involving a trade-off between training and inference efficiency: embedding space alignment (e.g., LLaVA-1.5) is inefficient during inference, while cross-attention space alignment (e.g., Flamingo) is inefficient in training. In this paper, we compare these two architectures and identify the key factors for building efficient MLLMs. A primary difference between them lies in how attention is applied to visual tokens, particularly in their interactions with each other. To investigate whether attention among visual tokens is necessary, we propose a new self-attention mechanism, NAAViT (\textbf{N}o \textbf{A}ttention \textbf{A}mong \textbf{Vi}sual \textbf{T}okens), which eliminates this type of attention. Our pilot experiment on LLaVA-1.5 shows that attention among visual tokens is highly redundant. Based on these insights, we introduce SAISA (\textbf{S}elf-\textbf{A}ttention \textbf{I}nput \textbf{S}pace \textbf{A}lignment), a novel architecture that enhance both training and inference efficiency. SAISA directly aligns visual features with the input spaces of NAAViT self-attention blocks, reducing computational overhead in both self-attention blocks and feed-forward networks (FFNs). Using the same configuration as LLaVA-1.5, SAISA reduces inference FLOPs by 66\% and training budget by 26\%, while achieving superior performance in terms of accuracy. Comprehensive ablation studies further validate the effectiveness of SAISA across various LLMs and visual encoders. The code and model will be publicly available at https://github.com/icip-cas/SAISA.
中文: 本文提出SAISA这一新型多模态大语言模型架构,通过消除视觉标记间的注意力机制,在提升模型精度的同时显著降低了训练与推理的计算成本。
English: This paper introduces SAISA, a novel multimodal large language model architecture that eliminates attention among visual tokens to significantly enhance both training and inference efficiency while achieving superior accuracy compared to existing models.

Authors:Ibrahim Bouabdallaoui, Fatima Guerouate, Samya Bouhaddour, Chaimae Saadi, Mohammed Sbihi
Title: FewTopNER: Integrating Few-Shot Learning with Topic Modeling and Named Entity Recognition in a Multilingual Framework
Abstract:
We introduce FewTopNER, a novel framework that integrates few-shot named entity recognition (NER) with topic-aware contextual modeling to address the challenges of cross-lingual and low-resource scenarios. FewTopNER leverages a shared multilingual encoder based on XLM-RoBERTa, augmented with language-specific calibration mechanisms, to generate robust contextual embeddings. The architecture comprises a prototype-based entity recognition branch, employing BiLSTM and Conditional Random Fields for sequence labeling, and a topic modeling branch that extracts document-level semantic features through hybrid probabilistic and neural methods. A cross-task bridge facilitates dynamic bidirectional attention and feature fusion between entity and topic representations, thereby enhancing entity disambiguation by incorporating global semantic context. Empirical evaluations on multilingual benchmarks across English, French, Spanish, German, and Italian demonstrate that FewTopNER significantly outperforms existing state-of-the-art few-shot NER models. In particular, the framework achieves improvements of 2.5-4.0 percentage points in F1 score and exhibits enhanced topic coherence, as measured by normalized pointwise mutual information. Ablation studies further confirm the critical contributions of the shared encoder and cross-task integration mechanisms to the overall performance. These results underscore the efficacy of incorporating topic-aware context into few-shot NER and highlight the potential of FewTopNER for robust cross-lingual applications in low-resource settings.
中文摘要:FewTopNER是一种创新框架,通过融合主题感知上下文建模和跨任务特征交互,显著提升了低资源场景下多语言小样本命名实体识别的性能表现。
English Summary: FewTopNER is a novel framework that enhances few-shot named entity recognition by integrating topic-aware contextual modeling and cross-task feature fusion, achieving superior performance across multiple languages in low-resource scenarios.

Authors:Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu
Title: STAIR: Improving Safety Alignment with Introspective Reasoning
Abstract:
Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.
中文摘要:STAIR是一种新颖框架,通过结合自省推理和迭代优化来增强大语言模型的安全性,相比传统对齐方法,在减少有害输出的同时更好地保持了实用性。
English Summary: STAIR is a novel framework that enhances LLM safety by integrating introspective reasoning and iterative optimization, effectively reducing harmful outputs while maintaining helpfulness compared to traditional alignment methods.

Authors:Daniel Tamayo, Aitor Gonzalez-Agirre, Javier Hernando, Marta Villegas
Title: Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Abstract:
Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at https://github.com/dtamayo-nlp/MEMAT.
中文摘要:本研究提出MEMAT新方法,通过利用注意力机制显著提升大语言模型的知识编辑能力,在仅需少量参数修改的情况下实现指标10%的提升,并能惠及未训练语言。
English Summary: This study introduces MEMAT, a novel method that significantly enhances knowledge editing in large language models by leveraging attention mechanisms, achieving a 10% improvement in metrics while requiring minimal parameter changes and benefiting untrained languages.

Authors:Yuan Gao, Mattia Piccinini, Korbinian Moller, Amr Alanwar, Johannes Betz
Title: From Words to Collisions: LLM-Guided Evaluation and Adversarial Generation of Safety-Critical Driving Scenarios
Abstract:
Ensuring the safety of autonomous vehicles requires virtual scenario-based testing, which depends on the robust evaluation and generation of safety-critical scenarios. So far, researchers have used scenario-based testing frameworks that rely heavily on handcrafted scenarios as safety metrics. To reduce the effort of human interpretation and overcome the limited scalability of these approaches, we combine Large Language Models (LLMs) with structured scenario parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios. We introduce Cartesian and Ego-centric prompt strategies for scenario evaluation, and an adversarial generation module that modifies trajectories of risk-inducing vehicles (ego-attackers) to create critical scenarios. We validate our approach using a 2D simulation framework and multiple pre-trained LLMs. The results show that the evaluation module effectively detects collision scenarios and infers scenario safety. Meanwhile, the new generation module identifies high-risk agents and synthesizes realistic, safety-critical scenarios. We conclude that an LLM equipped with domain-informed prompting techniques can effectively evaluate and generate safety-critical driving scenarios, reducing dependence on handcrafted metrics. We release our open-source code and scenarios at: https://github.com/TUM-AVS/From-Words-to-Collisions.
中文摘要:本研究将大型语言模型与结构化场景解析及提示工程相结合,自动评估和生成安全关键驾驶场景,有效降低对人工方法的依赖,并在碰撞检测和真实场景合成方面展现出卓越性能。
English Summary: This study integrates Large Language Models with structured parsing and prompt engineering to automatically evaluate and generate safety-critical driving scenarios, effectively reducing reliance on manual methods while demonstrating high performance in collision detection and realistic scenario synthesis.

Authors:Shivam Singh, Karthik Swaminathan, Nabanita Dash, Ramandeep Singh, Snehasis Banerjee, Mohan Sridharan, Madhava Krishna
Title: AdaptBot: Combining LLM with Knowledge Graphs and Human Input for Generic-to-Specific Task Decomposition and Knowledge Refinement
Abstract:
An embodied agent assisting humans is often asked to complete new tasks, and there may not be sufficient time or labeled examples to train the agent to perform these new tasks. Large Language Models (LLMs) trained on considerable knowledge across many domains can be used to predict a sequence of abstract actions for completing such tasks, although the agent may not be able to execute this sequence due to task-, agent-, or domain-specific constraints. Our framework addresses these challenges by leveraging the generic predictions provided by LLM and the prior domain knowledge encoded in a Knowledge Graph (KG), enabling an agent to quickly adapt to new tasks. The robot also solicits and uses human input as needed to refine its existing knowledge. Based on experimental evaluation in the context of cooking and cleaning tasks in simulation domains, we demonstrate that the interplay between LLM, KG, and human input leads to substantial performance gains compared with just using the LLM. Project website§: https://sssshivvvv.github.io/adaptbot/

Authors:Wenhao Zheng, Yixiao Chen, Weitong Zhang, Souvik Kundu, Yun Li, Zhengzhong Liu, Eric P. Xing, Hongyi Wang, Huaxiu Yao
Title: CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
Abstract:
Large language models have achieved remarkable success in various tasks but suffer from high computational costs during inference, limiting their deployment in resource-constrained applications. To address this issue, we propose a novel Collaborative Inference with Token-lEvel Routing (CITER) framework that enables efficient collaboration between small and large language models (SLMs \& LLMs) through a token-level routing strategy. Specifically, CITER routes non-critical tokens to an SLM for efficiency and routes critical tokens to an LLM for generalization quality. We formulate router training as a policy optimization, where the router receives rewards based on both the quality of predictions and the inference costs of generation. This allows the router to learn to predict token-level routing scores and make routing decisions based on both the current token and the future impact of its decisions. To further accelerate the reward evaluation process, we introduce a shortcut which significantly reduces the costs of the reward estimation and improving the practicality of our approach. Extensive experiments on five benchmark datasets demonstrate that CITER reduces the inference costs while preserving high-quality generation, offering a promising solution for real-time and resource-constrained applications. Our data and code are available at https://github.com/aiming-lab/CITER.
中文摘要:CITER框架通过令牌级路由策略,将非关键令牌分配给小型语言模型以提高效率,关键令牌分配给大型模型保证质量,在降低推理成本的同时保持生成内容的高水准。
English Summary: The CITER framework optimizes inference efficiency by routing non-critical tokens to a small language model for speed and critical tokens to a large model for accuracy, balancing cost and quality through token-level decisions.

Authors:Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu
Title: Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning
Abstract:
Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning.
Chinese: 近期研究表明,在大语言模型的监督微调中,数据质量比数量更重要,本文提出了一种基于噪声标签视角的通用令牌清洗流程,通过过滤非信息性令牌来提升下游任务性能。
English: Recent research reveals that in supervised fine-tuning of large language models, token-level data quality is more critical than quantity, leading to the development of a token cleaning pipeline that filters uninformative tokens to enhance downstream task performance.

Authors:Angelina Wang, Michelle Phan, Daniel E. Ho, Sanmi Koyejo
Title: Fairness through Difference Awareness: Measuring Desired Group Discrimination in LLMs
Abstract:
Algorithmic fairness has conventionally adopted the mathematically convenient perspective of racial color-blindness (i.e., difference unaware treatment). However, we contend that in a range of important settings, group difference awareness matters. For example, differentiating between groups may be necessary in legal contexts (e.g., the U.S. compulsory draft applies to men but not women) and harm assessments (e.g., referring to girls as ``terrorists'' may be less harmful than referring to Muslim people as such). Thus, in contrast to most fairness work, we study fairness through the perspective of treating people differently -- when it is contextually appropriate to. We first introduce an important distinction between descriptive (fact-based), normative (value-based), and correlation (association-based) benchmarks. This distinction is significant because each category requires separate interpretation and mitigation tailored to its specific characteristics. Then, we present a benchmark suite composed of eight different scenarios for a total of 16k questions that enables us to assess difference awareness. Finally, we show results across ten models that demonstrate difference awareness is a distinct dimension to fairness where existing bias mitigation strategies may backfire.
中文摘要:该研究主张算法公平性应关注群体差异而非采用色盲方法,通过引入新基准和广泛测试表明,在这一独特的公平维度上,现有的偏见缓解策略可能适得其反。
English Summary: The study argues that algorithmic fairness should incorporate group difference awareness rather than color-blind approaches, introducing new benchmarks and demonstrating through extensive testing that existing bias mitigation methods can be counterproductive in this distinct dimension of fairness.

Authors:Avery Ma, Yangchen Pan, Amir-massoud Farahmand
Title: PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Abstract:
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question-answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Chinese: 本文提出PANDAS混合技术,通过整合积极肯定、负面演示和自适应采样来增强多轮越狱攻击,并在长上下文场景中验证其优于基线方法的性能。
English: The paper introduces PANDAS, a hybrid technique that enhances many-shot jailbreaking by incorporating positive affirmations, negative demonstrations, and adaptive sampling, and demonstrates its superior performance over baseline methods in long-context scenarios.

Authors:Bo Pang, Tingrui Qiao, Caroline Walker, Chris Cunningham, Yun Sing Koh
Title: LIBRA: Measuring Bias of Large Language Model from a Local Context
Abstract:
Large Language Models (LLMs) have significantly advanced natural language processing applications, yet their widespread use raises concerns regarding inherent biases that may reduce utility or harm for particular social groups. Despite the advancement in addressing LLM bias, existing research has two major limitations. First, existing LLM bias evaluation focuses on the U.S. cultural context, making it challenging to reveal stereotypical biases of LLMs toward other cultures, leading to unfair development and use of LLMs. Second, current bias evaluation often assumes models are familiar with the target social groups. When LLMs encounter words beyond their knowledge boundaries that are unfamiliar in their training data, they produce irrelevant results in the local context due to hallucinations and overconfidence, which are not necessarily indicative of inherent bias. This research addresses these limitations with a Local Integrated Bias Recognition and Assessment Framework (LIBRA) for measuring bias using datasets sourced from local corpora without crowdsourcing. Implementing this framework, we develop a dataset comprising over 360,000 test cases in the New Zealand context. Furthermore, we propose the Enhanced Idealized CAT Score (EiCAT), integrating the iCAT score with a beyond knowledge boundary score (bbs) and a distribution divergence-based bias measurement to tackle the challenge of LLMs encountering words beyond knowledge boundaries. Our results show that the BERT family, GPT-2, and Llama-3 models seldom understand local words in different contexts. While Llama-3 exhibits larger bias, it responds better to different cultural contexts. The code and dataset are available at: https://github.com/ipangbo/LIBRA.
中文摘要:本研究提出LIBRA框架,利用本地数据集评估大语言模型的文化偏见,通过增强理想化CAT分数解决现有评估方法的局限,涵盖文化多样性并处理模型对陌生词汇的响应问题。
English Summary: This research introduces the LIBRA framework to evaluate cultural biases in Large Language Models using local datasets, addressing limitations in current bias assessments by incorporating cultural diversity and handling unfamiliar terms through the Enhanced Idealized CAT Score.

Authors:Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, Xu Yang
Title: Fast Large Language Model Collaborative Decoding via Speculation
Abstract:
Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding--where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.
Chinese: CoS是一种新颖的协作解码框架,通过推测验证和交替模型角色,在不牺牲性能的情况下将解码速度提升最高达2.23倍。
English: CoS is a novel framework that accelerates collaborative decoding by up to 2.23 times without sacrificing performance, using speculative verification and alternating model roles to enhance efficiency.

Authors:Muhammad Zain Raza, Jiawei Xu, Terence Lim, Lily Boddy, Carlos M. Mery, Andrew Well, Ying Ding
Title: LLM-TA: An LLM-Enhanced Thematic Analysis Pipeline for Transcripts from Parents of Children with Congenital Heart Disease
Abstract:
Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. https://github.com/jiaweixu98/LLM-TA
中文: 本研究提出了一种LLM增强主题分析(LLM-TA)流程,通过整合先进语言模型与专家知识显著提升了医疗研究的可扩展性和效率,尽管在归纳性主题分析中尚未完全达到人类专家的水平。
English: This study introduces an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline that significantly improves scalability and efficiency in healthcare research by integrating advanced language models with human expertise, though it has not yet achieved full human-level quality in inductive thematic analysis.

Authors:Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal
Title: Learning to Generate Unit Tests for Automated Debugging
Abstract:
Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Moreover, we observe that feedback from Qwen2.5 32B-based UTGen model can enhance debugging with frontier LLMs like GPT-4o by 13.8%. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.
中文:UTGen教导大语言模型生成能揭示代码错误的单元测试输入及正确预期输出,而UTDebug通过测试时计算和回溯验证来增强该过程,从而显著提升调试效果与代码纠错准确率。
English: UTGen enables LLMs to generate unit test inputs that reveal code errors and predict correct outputs, while UTDebug enhances this process through test-time computation and validation to improve debugging effectiveness and accuracy.

Authors:Yanbo Wang, Zixiang Xu, Yue Huang, Chujie Gao, Siyuan Wu, Jiayi Ye, Pin-Yu Chen, Xiuying Chen, Xiangliang Zhang
Title: Adaptive Distraction: Probing LLM Contextual Robustness with Automated Tree Search
Abstract:
Large Language Models (LLMs) often struggle to maintain their original performance when faced with semantically coherent but task-irrelevant contextual information. Although prior studies have explored this issue using fixed-template or retrieval-based distractions, such static methods show limited effectiveness against contemporary models. To address this problem, we propose a dynamic distraction generation framework based on tree search, where the generation process is guided by model behavior. Without modifying the original question or answer, the method efficiently produces challenging adaptive distractions across multiple datasets, enabling systematic stress testing of LLMs' contextual robustness. Experiments on four benchmarks demonstrate that the generated distractions lead to an average performance drop of over 45\% for mainstream models. Further comparisons of mitigation strategies show that prompt-based optimization methods yield limited gains, whereas post-training approaches (e.g., DPO) significantly enhance the model's contextual robustness. The results indicate that these issues do not stem from knowledge deficits in LLMs, but from a fundamental inability to maintain consistent reasoning under contextual distraction, posing a major challenge to the reliability of LLMs in real-world applications. The code is publicly available at https://github.com/wyf23187/Adaptive_Distractions.
中文摘要: 本文提出了一种基于树搜索的动态干扰生成框架,通过产生自适应干扰项系统性测试大语言模型,发现其在上下文干扰下会出现严重性能下降,这源于模型推理一致性的根本缺陷而非知识储备问题。
English Summary: This paper introduces a dynamic tree search-based framework that generates adaptive distractions to test and reveal large language models' significant performance drops under contextual interference, highlighting their fundamental reasoning consistency issues rather than knowledge gaps.

Authors:Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang
Title: Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding
Abstract:
Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at https://github.com/MingyuJ666/Rope_with_LLM.
中文: 大语言模型中注意力查询和键因旋转位置编码出现集中大值,这些值对上下文知识解释至关重要,而非参数知识检索。
English: Large language models exhibit concentrated massive values in attention queries and keys due to Rotary Positional Encoding, which are crucial for contextual knowledge interpretation rather than parametric knowledge retrieval.

Authors:Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, Huan Liu
Title: Preference Leakage: A Contamination Problem in LLM-as-a-judge
Abstract:
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods in model development. While their combination significantly enhances the efficiency of model training and evaluation, little attention has been given to the potential contamination brought by this new model development paradigm. In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators. To study this issue, we first define three common relatednesses between the data generator LLM and the judge LLM: being the same model, having an inheritance relationship, and belonging to the same model family. Through extensive experiments, we empirically confirm the bias of judges towards their related student models caused by preference leakage across multiple LLM baselines and benchmarks. Further analysis suggests that preference leakage is a pervasive and real-world problem that is harder to detect compared to previously identified biases in LLM-as-a-judge scenarios. All of these findings imply that preference leakage is a widespread and challenging problem in the area of LLM-as-a-judge. We release all codes and data at: https://github.com/David-Li0406/Preference-Leakage.
中文: 本研究揭示了在LLM作为评判者的场景中,偏好泄露这一污染问题,即评估者对来自相关模型的合成数据表现出偏向性,表明这是模型开发中普遍存在且难以检测的难题。
English: This study identifies preference leakage as a contamination issue in LLM-as-a-judge scenarios, where evaluators show bias toward synthetically generated data from related models, revealing it as a pervasive and hard-to-detect problem in model development.

Authors:Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Lei Li
Title: Explaining Context Length Scaling and Bounds for Language Models
Abstract:
Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impacts Language Modeling. In this work, we (1) propose a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at: https://github.com/JingzheShi/NLPCtlScalingAndBounds.
中文: 本研究从内在空间视角提出了一个理论框架,解释上下文长度对语言建模的影响,并通过自然语言和合成数据实验验证,发现训练数据集大小决定了最优上下文长度并设定了扩展界限。
English: This study introduces a theoretical framework from an Intrinsic Space perspective to explain how context length affects Language Modeling, validated through experiments on natural and synthetic data, revealing that training dataset size determines optimal context length and sets scaling bounds.

Authors:Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, Ning Ding
Title: Process Reinforcement through Implicit Rewards
Abstract:
Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.
Chinese: PRIME通过隐式过程奖励,仅利用策略推演和结果标签实现在线过程奖励模型更新,在数学和编程任务中无需专门奖励模型训练即取得显著性能提升。
English: PRIME introduces implicit process rewards to enable online updates of process reward models using only policy rollouts and outcome labels, achieving significant improvements in math and coding tasks without dedicated reward model training.

Authors:Zhiteng Li, Mingyuan Xia, Jingyuan Zhang, Zheng Hui, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang
Title: AdaSVD: Adaptive Singular Value Decomposition for Large Language Models
Abstract:
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP) tasks, yet their substantial memory requirements present significant challenges for deployment on resource-constrained devices. Singular Value Decomposition (SVD) has emerged as a promising compression technique for LLMs, offering considerable reductions in memory overhead. However, existing SVD-based methods often struggle to effectively mitigate the errors introduced by SVD truncation, leading to a noticeable performance gap when compared to the original models. Furthermore, applying a uniform compression ratio across all transformer layers fails to account for the varying importance of different layers. To address these challenges, we propose AdaSVD, an adaptive SVD-based LLM compression approach. Specifically, AdaSVD introduces adaComp, which adaptively compensates for SVD truncation errors by alternately updating the singular matrices $\mathcal{U}$ and $\mathcal{V}^\top$. Additionally, AdaSVD introduces adaCR, which adaptively assigns layer-specific compression ratios based on the relative importance of each layer. Extensive experiments across multiple LLM/VLM families and evaluation metrics demonstrate that AdaSVD consistently outperforms state-of-the-art (SOTA) SVD-based methods, achieving superior performance with significantly reduced memory requirements. Code and models of AdaSVD will be available at https://github.com/ZHITENGLI/AdaSVD.
Chinese: AdaSVD是一种自适应大语言模型压缩方法,通过误差补偿和分层压缩比分配,在显著降低内存占用的同时超越了现有SVD技术的性能表现。
English: AdaSVD is an adaptive LLM compression method that uses error compensation and layer-specific compression ratios to outperform existing SVD techniques while significantly reducing memory usage.

Authors:Oussama Zekri, Nicolas Boullé
Title: Fine-Tuning Discrete Diffusion Models with Policy Gradient Methods
Abstract:
Discrete diffusion models have recently gained significant attention due to their ability to process complex discrete structures for language modeling. However, fine-tuning these models with policy gradient methods, as is commonly done in Reinforcement Learning from Human Feedback (RLHF), remains a challenging task. We propose an efficient, broadly applicable, and theoretically justified policy gradient algorithm, called Score Entropy Policy Optimization (SEPO), for fine-tuning discrete diffusion models over non-differentiable rewards. Our numerical experiments across several discrete generative tasks demonstrate the scalability and efficiency of our method. Our code is available at https://github.com/ozekri/SEPO.
Chinese: 本文提出了评分熵策略优化(SEPO),一种高效且理论可靠的策略梯度算法,用于针对不可微分奖励微调离散扩散模型,并在多个生成任务中验证了其有效性。
English: This paper introduces Score Entropy Policy Optimization (SEPO), an efficient and theoretically grounded policy gradient algorithm for fine-tuning discrete diffusion models with non-differentiable rewards, demonstrating its effectiveness across various generative tasks.

Authors:Ismail Khalfaoui-Hassani, Stefan Kesselheim
Title: Polynomial, trigonometric, and tropical activations
Abstract:
Which functions can be used as activations in deep neural networks? This article explores families of functions based on orthonormal bases, including the Hermite polynomial basis and the Fourier trigonometric basis, as well as a basis resulting from the tropicalization of a polynomial basis. Our study shows that, through simple variance-preserving initialization and without additional clamping mechanisms, these activations can successfully be used to train deep models, such as GPT-2 for next-token prediction on OpenWebText and ConvNeXt for image classification on ImageNet. Our work addresses the issue of exploding and vanishing activations and gradients, particularly prevalent with polynomial activations, and opens the door for improving the efficiency of large-scale learning tasks. Furthermore, our approach provides insight into the structure of neural networks, revealing that networks with polynomial activations can be interpreted as multivariate polynomial mappings. Finally, using Hermite interpolation, we show that our activations can closely approximate classical ones in pre-trained models by matching both the function and its derivative, making them especially useful for fine-tuning tasks. These activations are available in the torchortho library, which can be accessed via: https://github.com/K-H-Ismail/torchortho.
中文摘要:本文研究表明,基于正交基(如埃尔米特多项式和傅里叶基)的激活函数能有效用于深度神经网络训练,不仅解决了梯度爆炸/消失问题,还能通过插值逼近经典激活函数,特别适用于微调任务。
English Summary: This article demonstrates that orthonormal basis functions, such as Hermite polynomials and Fourier bases, can effectively serve as activations in deep neural networks, enabling stable training without special mechanisms while providing insights into network structure and approximation capabilities.

Authors:Wen Lai, Alexander Fraser, Ivan Titov
Title: Joint Localization and Activation Editing for Low-Resource Fine-Tuning
Abstract:
Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, are commonly used to adapt LLMs. However, the effectiveness of standard PEFT methods is limited in low-resource scenarios with only a few hundred examples. Recent advances in interpretability research have inspired the emergence of activation editing (or steering) techniques, which modify the activations of specific model components. Due to their extremely small parameter counts, these methods show promise for small datasets. However, their performance is highly dependent on identifying the correct modules to edit and often lacks stability across different datasets. In this paper, we propose Joint Localization and Activation Editing (JoLA), a method that jointly learns (1) which heads in the Transformer to edit (2) whether the intervention should be additive, multiplicative, or both and (3) the intervention parameters themselves - the vectors applied as additive offsets or multiplicative scalings to the head output. Through evaluations on three benchmarks spanning commonsense reasoning, natural language understanding, and natural language generation, we demonstrate that JoLA consistently outperforms existing methods. The code for the method is released at https://github.com/wenlai-lavine/jola.
Chinese: 提出的JoLA方法联合学习需要编辑的Transformer头部、干预类型(加法/乘法)及干预参数,在有限训练数据下,于多项基准测试中持续超越现有方法。
English: The proposed JoLA method jointly learns which Transformer heads to edit, the type of intervention (additive/multiplicative), and the intervention parameters, consistently outperforming existing methods across multiple benchmarks despite limited training data.

Authors:Guanlin Li, Kangjie Chen, Shangwei Guo, Jie Zhang, Han Qiu, Chao Zhang, Guoyin Wang, Tianwei Zhang, Jiwei Li
Title: Picky LLMs and Unreliable RMs: An Empirical Study on Safety Alignment after Instruction Tuning
Abstract:
Large language models (LLMs) have emerged as powerful tools for addressing a wide range of general inquiries and tasks. Despite this, fine-tuning aligned LLMs on smaller, domain-specific datasets, critical to adapting them to specialized tasks, can inadvertently degrade their safety alignment, even when the datasets are benign. This phenomenon makes models more susceptible to providing inappropriate responses. In this study, we systematically examine the factors contributing to safety alignment degradation in benign fine-tuning scenarios. Our analysis identifies three critical factors affecting aligned LLMs: answer structure, identity calibration, and role-play. Additionally, we evaluate the reliability of state-of-the-art reward models (RMs), which are often used to guide alignment processes. Our findings reveal that these RMs frequently fail to accurately reflect human preferences regarding safety, underscoring their limitations in practical applications. By uncovering these challenges, our work highlights the complexities of maintaining safety alignment during fine-tuning and offers guidance to help developers balance utility and safety in LLMs. Datasets and fine-tuning code used in our experiments can be found in https://github.com/GuanlinLee/llm_instruction_tuning.
Chinese: 在领域特定数据集上微调对齐的大型语言模型会削弱其安全对齐性,研究发现三个关键因素及当前奖励模型在保障安全方面的局限性。
English: Fine-tuning aligned large language models on domain-specific datasets can compromise their safety alignment, revealing three key factors and limitations in current reward models for maintaining safety.

Authors:Jiali Cheng, Hadi Amiri
Title: Tool Unlearning for Tool-Augmented LLMs
Abstract:
Tool-augmented large language models (LLMs) are often trained on datasets of query-response pairs, which embed the ability to use tools or APIs directly into the parametric knowledge of LLMs. Tool-augmented LLMs need the ability to forget learned tools due to security vulnerabilities, privacy regulations, or tool deprecations. However, ``tool unlearning'' has not been investigated in unlearning literature. We introduce this novel task, which requires addressing distinct challenges compared to traditional unlearning: knowledge removal rather than forgetting individual samples, the high cost of optimizing LLMs, and the need for principled evaluation metrics. To bridge these gaps, we propose ToolDelete, the first approach for unlearning tools from tool-augmented LLMs. It implements three key properties to address the above challenges for effective tool unlearning and introduces a new membership inference attack (MIA) model for effective evaluation. Extensive experiments on multiple tool learning datasets and tool-augmented LLMs show that ToolDelete effectively unlearns randomly selected tools, while preserving the LLM's knowledge on non-deleted tools and maintaining performance on general tasks.

Authors:Vernon Y. H. Toh, Yew Ken Chia, Deepanway Ghosal, Soujanya Poria
Title: The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Abstract:
The releases of OpenAI's o-[n] series, such as o1, o3, and o4-mini, mark a significant paradigm shift in Large Language Models towards advanced reasoning capabilities. Notably, models like o3 have demonstrated strong performance on benchmarks like the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). However, this benchmark is limited to symbolic patterns, whereas humans often perceive and reason about multimodal scenarios involving both vision and language data. Thus, there is an urgent need to investigate advanced reasoning capabilities in multimodal tasks. To this end, we track the evolution of the GPT-[n] and o-[n] series models (including o1, o3, and o4-mini) on challenging multimodal puzzles from PuzzleVQA and AlgoPuzzleVQA, which demand fine-grained visual perception. Our results reveal that o-[n] series, particularly later iterations like o3 and o4-mini, significantly outperform the GPT-[n] series and show strong scalability in multimodal reasoning. Nonetheless, despite these substantial advancements and the superior capabilities demonstrated by the o-[n] series, our findings highlight that even these leading models face persistent challenges. Difficulties are particularly evident in tasks requiring precise visual perception, robust compositional reasoning across multiple visual attributes, and solving complex algorithmic or highly combinatorial puzzles, indicating critical areas for future AGI development. We plan to continuously track new models in the series and update our results in this paper accordingly. All resources used in this evaluation are openly available at https://github.com/declare-lab/LLM-PuzzleTest.
中文: OpenAI的o系列模型在多模态推理能力上显著优于GPT系列,但在需要精确视觉感知和复杂组合推理的任务中仍面临挑战。
English: OpenAI's o-series models demonstrate superior multimodal reasoning capabilities over GPT-series models but still face challenges in tasks requiring precise visual perception and complex compositional reasoning.

Authors:Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim
Title: FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
Abstract:
While large language models (LLMs) excel at handling long-context sequences, they require substantial key-value (KV) caches to store contextual information, which can heavily burden computational efficiency and memory usage. Previous efforts to compress these KV caches primarily focused on reducing memory demands but were limited in enhancing latency. To address this issue, we introduce FastKV, a KV cache compression method designed to reduce latency for long-context inference. FastKV improves processing speed while preserving accuracy by adopting Token-Selective Propagation (TSP). This approach preserves full-context information in early layers of LLMs and selectively propagates only a portion of this information in later layers. This design enables FastKV to minimize redundant computation without sacrificing contextual fidelity. Our experimental results show that FastKV achieves up to 1.97$\times$ and 4.82$\times$ improvements in time-to-first-token (TTFT) and throughput, respectively, compared to baseline without KV cache compression. Moreover, FastKV successfully maintains accuracy within 1\% of the baseline on long-context benchmarks. Our code is available at https://github.com/dongwonjo/FastKV.
中文: FastKV是一种KV缓存压缩方法,通过令牌选择性传播技术,在保持长上下文基准测试准确率接近基线1%以内的同时,显著提升了首个令牌生成时间和系统吞吐量。
English: FastKV is a KV cache compression method that uses Token-Selective Propagation to significantly reduce latency and improve throughput in long-context LLM inference while maintaining accuracy within 1% of baseline performance.

Authors:Ehsaneddin Asgari, Yassine El Kheir, Mohammad Ali Sadraei Javaheri
Title: MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies
Abstract:
Tokenization is fundamental to Natural Language Processing (NLP), directly impacting model efficiency and linguistic fidelity. While Byte Pair Encoding (BPE) is widely used in Large Language Models (LLMs), it often disregards morpheme boundaries, leading to suboptimal segmentation, particularly in morphologically rich languages. We introduce MorphBPE, a morphology-aware extension of BPE that integrates linguistic structure into subword tokenization while preserving statistical efficiency. Additionally, we propose two morphology-based evaluation metrics: (i) Morphological Consistency F1-Score, which quantifies the consistency between morpheme sharing and token sharing, contributing to LLM training convergence, and (ii) Morphological Edit Distance, which measures alignment between morphemes and tokens concerning interpretability. Experiments on English, Russian, Hungarian, and Arabic across 300M and 1B parameter LLMs demonstrate that MorphBPE consistently reduces cross-entropy loss, accelerates convergence, and improves morphological alignment scores. Fully compatible with existing LLM pipelines, MorphBPE requires minimal modifications for integration. The MorphBPE codebase and tokenizer playground will be available at: https://github.com/llm-lab-org/MorphBPE and https://tokenizer.llm-lab.org
中文:MorphBPE是一种融合形态学结构的BPE分词改进方法,通过引入形态一致性评估指标,在提升大语言模型分词效果和训练收敛速度的同时保持与现有流程的完全兼容。
English: MorphBPE enhances BPE tokenization by incorporating morphological awareness, improving linguistic fidelity and model efficiency in large language models while introducing novel evaluation metrics for better segmentation and interpretability.

Authors:Teng Xiao, Yige Yuan, Zhengyu Chen, Mingxiao Li, Shangsong Liang, Zhaochun Ren, Vasant G Honavar
Title: SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters
Abstract:
Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches-even without any hyperparameters or a reference model . For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.
Chinese: 提出的SimPER算法采用无需超参数调优的偏好优化方法,通过逆困惑度对齐语言模型,在多个基准测试中无需参考模型即实现卓越性能。
English: The proposed SimPER algorithm introduces a hyperparameter-free preference optimization method that uses inverse perplexity to align language models, achieving superior performance across multiple benchmarks without requiring costly tuning or reference models.

Authors:Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Title: BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts
Abstract:
Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion where exit classifiers are treated as experts BEEM and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5x to 2.1x. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code for this work is publicly available at https://github.com/Div290/BEEM1/tree/main
中文: BEEM提出通过聚合相邻分类器在预测一致时的置信度作为早退新标准,在图像描述和语言任务中实现1.5-2.1倍加速,同时保持或提升模型精度。
English: BEEM introduces a novel early exit criterion that aggregates confidence scores from neighboring classifiers when predictions are consistent, achieving 1.5x-2.1x speed-up while maintaining comparable or improved accuracy across vision and language tasks.

Authors:Hyeong Kyu Choi, Maxim Khanov, Hongxin Wei, Yixuan Li
Title: How Contaminated Is Your Benchmark? Quantifying Dataset Leakage in Large Language Models with Kernel Divergence
Abstract:
Dataset contamination, where evaluation datasets overlap with pre-training corpora, inflates performance metrics and undermines the reliability of model evaluations. Measuring dataset contamination thus becomes essential to ensure that performance evaluations genuinely reflect a model's ability to generalize to unseen data, rather than relying on memorized examples. To address this problem, we propose Kernel Divergence Score (KDS), a novel method that evaluates dataset contamination by computing the divergence between the kernel similarity matrix of sample embeddings, before and after fine-tuning on the benchmark dataset. Leveraging the insight that fine-tuning affects unseen samples more significantly than seen ones, KDS provides a reliable measure of contamination. Through extensive experiments on controlled contamination scenarios, KDS demonstrates a near-perfect correlation with contamination levels and outperforms existing baselines. Additionally, we perform comprehensive ablation studies to analyze the impact of key design choices, providing deeper insights into the components and effectiveness of KDS. These ablations highlight the importance of leveraging fine-grained kernel-based information and confirm the reliability of the proposed framework across diverse datasets and settings. Code is released in https://github.com/deeplearning-wisc/kernel-divergence-score.
中文: 提出的核散度评分(KDS)通过比较微调前后样本嵌入的核相似性矩阵,有效评估数据集污染问题,在受控实验中展现出与污染程度近乎完美的相关性,并优于现有基线方法。
English: The proposed Kernel Divergence Score (KDS) effectively measures dataset contamination by comparing kernel similarity matrices of sample embeddings before and after fine-tuning, demonstrating superior correlation with contamination levels and outperforming existing methods in controlled experiments.

Authors:Donglei Yu, Yang Zhao, Jie Zhu, Yangyifan Xu, Yu Zhou, Chengqing Zong
Title: SimulPL: Aligning Human Preferences in Simultaneous Machine Translation
Abstract:
Simultaneous Machine Translation (SiMT) generates translations while receiving streaming source inputs. This requires the SiMT model to learn a read/write policy, deciding when to translate and when to wait for more source input. Numerous linguistic studies indicate that audiences in SiMT scenarios have distinct preferences, such as accurate translations, simpler syntax, and no unnecessary latency. Aligning SiMT models with these human preferences is crucial to improve their performances. However, this issue still remains unexplored. Additionally, preference optimization for SiMT task is also challenging. Existing methods focus solely on optimizing the generated responses, ignoring human preferences related to latency and the optimization of read/write policy during the preference optimization phase. To address these challenges, we propose Simultaneous Preference Learning (SimulPL), a preference learning framework tailored for the SiMT task. In the SimulPL framework, we categorize SiMT human preferences into five aspects: \textbf{translation quality preference}, \textbf{monotonicity preference}, \textbf{key point preference}, \textbf{simplicity preference}, and \textbf{latency preference}. By leveraging the first four preferences, we construct human preference prompts to efficiently guide GPT-4/4o in generating preference data for the SiMT task. In the preference optimization phase, SimulPL integrates \textbf{latency preference} into the optimization objective and enables SiMT models to improve the read/write policy, thereby aligning with human preferences more effectively. Experimental results indicate that SimulPL exhibits better alignment with human preferences across all latency levels in Zh$\rightarrow$En, De$\rightarrow$En and En$\rightarrow$Zh SiMT tasks. Our data and code will be available at https://github.com/EurekaForNLP/SimulPL.
中文: 同步机器翻译需要模型在接收源输入时平衡翻译决策,而提出的SimulPL框架通过整合翻译质量、单调性、关键点、简洁性和延迟这五类人类偏好,有效优化了多语言任务中的表现和一致性。
English: Simultaneous Machine Translation requires models to balance translation decisions with input reception, and the proposed SimulPL framework addresses this by incorporating five human preferences—translation quality, monotonicity, key points, simplicity, and latency—to optimize performance and alignment in various language tasks.

Authors:Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, Zexue He
Title: M+: Extending MemoryLLM with Scalable Long-Term Memory
Abstract:
Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead. We open-source our code at https://github.com/wangyu-ustc/MemoryLLM
Chinese: M+在MemoryLLM基础上引入长期记忆机制和联合训练的检索器,将知识保留能力从不足2万标记扩展到超过16万标记,同时保持相近的GPU内存开销。
English: M+ enhances MemoryLLM by integrating a long-term memory mechanism and a co-trained retriever, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory usage.

Authors:Yizhe Xiong, Wei Huang, Xin Ye, Hui Chen, Zijia Lin, Haoran Lian, Zhenpeng Su, Jungong Han, Guiguang Ding
Title: UniAttn: Reducing Inference Costs via Softmax Unification for Post-Training LLMs
Abstract:
Post-training is essential for adapting Large Language Models (LLMs) to real-world applications. Deploying post-trained models faces significant challenges due to substantial memory overhead and noticeable inference latency. Existing work has identified significant redundancies in LLMs and proposed efficient architectures, namely intra-layer KV sharing and cross-layer KV sharing. However, intra-layer KV sharing still results in high inference costs, while cross-layer KV sharing leads to significant performance degradation. As a result, both methods remain suboptimal for post-training pre-trained LLMs. In this paper, we identify that the \texttt{Softmax} operation is a primary bottleneck for LLM inference and discover that it is actually highly redundant during post-training. We propose Softmax \textbf{Uni}fication in \textbf{Att}e\textbf{n}tion (\textbf{UniAttn}), a novel post-training method that unifies Softmax activations across transformer blocks to reduce LLM inference costs. Additionally, UniAttn adopts a linear projection to compensate for the errors induced by Softmax unification. Experiments show that UniAttn matches the performance of standard post-training while significantly reducing inference costs, outperforming existing efficient architectures during post-training. Our code will be available at \url{https://github.com/Bostoncake/UniAttn}.
中文摘要:针对大型语言模型(LLM)后训练在现实应用中面临的高内存开销和推理延迟问题,UniAttn方法通过统一Transformer块中的Softmax操作,在保持性能的同时显著降低了推理成本。
English Summary: Post-training large language models (LLMs) for real-world use faces challenges like high memory usage and slow inference, which the proposed UniAttn method addresses by unifying Softmax operations across transformer blocks to cut costs without sacrificing performance.

Authors:Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang
Title: Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language
Abstract:
We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both the challenges and the potential for improving ASR performance in Oromo. The dataset is publicly available at https://github.com/turinaf/sagalee and we encourage its use for further research and development in Oromo speech processing.
中文: 本研究通过众包收集了100小时的奥罗莫语语音数据集,以解决该语言在自动语音识别中的资源匮乏问题,实验显示通过Conformer和Whisper模型微调可将词错误率优化至10.82%,为奥罗莫语语音处理建立了首批基准性能指标。
English: This study introduces a novel 100-hour Oromo speech dataset collected via crowd-sourcing to address the language's underrepresentation in ASR, demonstrating through Conformer and Whisper model experiments that fine-tuning achieves a competitive 10.82% WER and establishing initial benchmarks for Oromo speech processing.

Authors:Xin Xu, Qiyun Xu, Tong Xiao, Tianhao Chen, Yuchen Yan, Jiaxin Zhang, Shizhe Diao, Can Yang, Yang Wang
Title: UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks, particularly in mathematics. However, the domain of physics reasoning presents unique challenges that have received significantly less attention. Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics, underscoring the need for a comprehensive evaluation. To fill this gap, we introduce UGPhysics, a large-scale and comprehensive benchmark specifically designed to evaluate UnderGraduate-level Physics (UGPhysics) reasoning with LLMs. UGPhysics includes 5,520 undergraduate-level physics problems in both English and Chinese, covering 13 subjects with seven different answer types and four distinct physics reasoning skills, all rigorously screened for data leakage. Additionally, we develop a Model-Assistant Rule-based Judgment (MARJ) pipeline specifically tailored for assessing answer correctness of physics problems, ensuring accurate evaluation. Our evaluation of 31 leading LLMs shows that the highest overall accuracy, 49.8% (achieved by OpenAI-o1-mini), emphasizes the necessity for models with stronger physics reasoning skills, beyond math abilities. We hope UGPhysics, along with MARJ, will drive future advancements in AI for physics reasoning. Codes and data are available at https://github.com/YangLabHKUST/UGPhysics .
中文:UGPhysics是一个专为评估大语言模型在本科物理推理能力而设计的大规模基准测试,通过揭示现有模型的显著性能差距并引入定制化评估方法,旨在推动人工智能在物理推理领域的未来发展。
English: UGPhysics is a comprehensive benchmark designed to evaluate undergraduate-level physics reasoning in large language models, revealing significant performance gaps and introducing a specialized assessment pipeline to advance AI capabilities in this domain.

Authors:Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara
Title: SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition
Abstract:
In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets. The source code of this paper is shared on the Github repository: https://github.com/alaaNfissi/SigWavNet-Learning-Multiresolution-Signal-Wavelet-Network-for-Speech-Emotion-Recognition.
Chinese: 本文提出了一种新颖的端到端深度学习框架,通过结合小波变换和注意力神经网络直接处理原始语音信号,在无需分段或预处理的情况下,在基准数据集上实现了优于现有方法的语音情感识别性能。
English: This paper presents a novel end-to-end deep learning framework for speech emotion recognition that leverages wavelet transforms and attention-based neural networks to directly process raw speech signals, achieving superior performance on benchmark datasets without requiring segmentation or preprocessing.

Authors:Abdurrahim Yilmaz, Furkan Yuceyalcin, Ece Gokyayla, Donghee Choi, Ozan Erdem, Ali Anil Demircali, Rahmetullah Varol, Ufuk Gorkem Kirabali, Gulsum Gencoglan, Joram M. Posma, Burak Temelkuran
Title: DermaSynth: Rich Synthetic Image-Text Pairs Using Open Access Dermatology Datasets
Abstract:
A major barrier to developing vision large language models (LLMs) in dermatology is the lack of large image--text pairs dataset. We introduce DermaSynth, a dataset comprising of 92,020 synthetic image--text pairs curated from 45,205 images (13,568 clinical and 35,561 dermatoscopic) for dermatology-related clinical tasks. Leveraging state-of-the-art LLMs, using Gemini 2.0, we used clinically related prompts and self-instruct method to generate diverse and rich synthetic texts. Metadata of the datasets were incorporated into the input prompts by targeting to reduce potential hallucinations. The resulting dataset builds upon open access dermatological image repositories (DERM12345, BCN20000, PAD-UFES-20, SCIN, and HIBA) that have permissive CC-BY-4.0 licenses. We also fine-tuned a preliminary Llama-3.2-11B-Vision-Instruct model, DermatoLlama 1.0, on 5,000 samples. We anticipate this dataset to support and accelerate AI research in dermatology. Data and code underlying this work are accessible at https://github.com/abdurrahimyilmaz/DermaSynth.
中文: 开发皮肤病学视觉大模型的主要障碍是缺乏大型图像-文本配对数据集,为此我们推出了DermaSynth,这是一个包含92,020对合成图像-文本的数据集,利用先进大语言模型和临床提示生成,旨在支持和加速皮肤病学人工智能研究。
English: The main challenge in developing vision large language models for dermatology is the scarcity of large image-text datasets, which is addressed by the introduction of DermaSynth, a comprehensive synthetic dataset of 92,020 image-text pairs generated using advanced LLMs and clinical prompts to support AI research in the field.

Authors:Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, Tatsunori Hashimoto
Title: s1: Simple test-time scaling
Abstract:
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
中文摘要:本研究通过构建精选数据集和预算强制技术,提出了一种简单的测试时扩展方法,使Qwen2.5-32B模型在数学推理任务上超越OpenAI的o1-preview,同时保持完全开源。
English Summary: This study introduces a simple test-time scaling method using a curated dataset and budget forcing technique, enabling the Qwen2.5-32B model to outperform OpenAI's o1-preview on math reasoning tasks while being fully open-source.

Authors:Xingyou Song, Dara Bahri
Title: Decoding-based Regression
Abstract:
Language models have recently been shown capable of performing regression wherein numeric predictions are represented as decoded strings. In this work, we provide theoretical grounds for this capability and furthermore investigate the utility of causal sequence decoding models as numeric regression heads given any feature representation. We find that, despite being trained in the usual way - for next-token prediction via cross-entropy loss - decoder-based heads are as performant as standard pointwise heads when benchmarked over standard regression tasks, while being flexible enough to capture smooth numeric distributions, such as in the task of density estimation.
中文摘要:语言模型能够通过将数值预测解码为字符串来有效执行回归任务,其中基于解码器的回归头在性能上与标准方法相当,同时能更好地捕捉平滑数值分布,如密度估计。
English Summary: Language models can effectively perform numeric regression by decoding predictions as strings, with decoder-based heads matching the performance of standard regression methods while offering greater flexibility for tasks like density estimation.

Authors:Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, Caiming Xiong
Title: Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Abstract:
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs). RSD synergistically combines a lightweight draft model with a more powerful target model, incorporating a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness. RSD employs a process reward model to evaluate intermediate decoding steps and dynamically decide whether to invoke the target model, optimizing the trade-off between computational cost and output quality. We theoretically demonstrate that a threshold-based mixture strategy achieves an optimal balance between resource utilization and performance. Extensive evaluations on challenging reasoning benchmarks, including Olympiad-level tasks, show that RSD delivers significant efficiency gains against decoding with the target model only (up to 4.4x fewer FLOPs), while achieving significant better accuracy than parallel decoding method on average (up to +3.5). These results highlight RSD as a robust and cost-effective approach for deploying LLMs in resource-intensive scenarios. The code is available at https://github.com/BaohaoLiao/RSD.
Chinese: RSD是一种新颖的推测解码框架,通过结合轻量级草稿模型与强大目标模型,并利用过程奖励模型动态优化计算成本与输出质量,显著提升大语言模型推理效率,最高减少4.4倍计算量且提高准确率。
English: RSD is a novel speculative decoding framework that enhances LLM inference efficiency by integrating a lightweight draft model with a powerful target model and using a process reward model to dynamically optimize computational cost and output quality, achieving up to 4.4x fewer FLOPs and improved accuracy.

Authors:Nafis Irtiza Tripto, Saranya Venkatraman, Mahjabin Nahar, Dongwon Lee
Title: Beyond checkmate: exploring the creative chokepoints in AI text
Abstract:
The rapid advancement of Large Language Models (LLMs) has revolutionized text generation but also raised concerns about potential misuse, making detecting LLM-generated text (AI text) increasingly essential. While prior work has focused on identifying AI text and effectively checkmating it, our study investigates a less-explored territory: portraying the nuanced distinctions between human and AI texts across text segments (introduction, body, and conclusion). Whether LLMs excel or falter in incorporating linguistic ingenuity across text segments, the results will critically inform their viability and boundaries as effective creative assistants to humans. Through an analogy with the structure of chess games, comprising opening, middle, and end games, we analyze segment-specific patterns to reveal where the most striking differences lie. Although AI texts closely resemble human writing in the body segment due to its length, deeper analysis shows a higher divergence in features dependent on the continuous flow of language, making it the most informative segment for detection. Additionally, human texts exhibit greater stylistic variation across segments, offering a new lens for distinguishing them from AI. Overall, our findings provide fresh insights into human-AI text differences and pave the way for more effective and interpretable detection strategies. Codes available at https://github.com/tripto03/chess_inspired_human_ai_text_distinction.
中文摘要:本研究揭示,尽管AI生成文本在主体段落与人类写作高度相似,但在需要语言连贯性的特征上差异显著,且人类文本在不同段落间风格变化更大,为检测提供了新视角。
English Summary: This study reveals that while AI-generated text closely mimics human writing in body segments, it shows greater divergence in features requiring continuous language flow, and human texts display more stylistic variation across different segments, offering new insights for detection.

Authors:Shumin Que, Anton Ragni
Title: VisualSpeech: Enhancing Prosody Modeling in TTS Using Video
Abstract:
Text-to-Speech (TTS) synthesis faces the inherent challenge of producing multiple speech outputs with varying prosody given a single text input. While previous research has addressed this by predicting prosodic information from both text and speech, additional contextual information, such as video, remains under-utilized despite being available in many applications. This paper investigates the potential of integrating visual context to enhance prosody prediction. We propose a novel model, VisualSpeech, which incorporates visual and textual information for improving prosody generation in TTS. Empirical results indicate that incorporating visual features improves prosodic modeling, enhancing the expressiveness of the synthesized speech. Audio samples are available at https://ariameetgit.github.io/VISUALSPEECH-SAMPLES/.

Authors:Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
Title: Memory-Efficient Fine-Tuning of Transformers via Token Selection
Abstract:
Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is available at https://github.com/facebookresearch/tokentune.
Chinese: TokenTune是一种针对Transformer模型的高效微调方法,通过在反向传播中仅处理部分输入标记来减少激活内存占用,在保持与全参数微调相当性能的同时显著降低了内存消耗。
English: TokenTune is a memory-efficient fine-tuning method for transformer models that reduces activation memory by processing only a subset of tokens during backward passes, achieving comparable performance to full fine-tuning while significantly lowering memory usage.

Authors:Daniel Schwartz, Dmitriy Bespalov, Zhe Wang, Ninad Kulkarni, Yanjun Qi
Title: Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation
Abstract:
As large language models (LLMs) become increasingly prevalent, ensuring their robustness against adversarial misuse is crucial. This paper introduces the GAP (Graph of Attacks with Pruning) framework, an advanced approach for generating stealthy jailbreak prompts to evaluate and enhance LLM safeguards. GAP addresses limitations in existing tree-based LLM jailbreak methods by implementing an interconnected graph structure that enables knowledge sharing across attack paths. Our experimental evaluation demonstrates GAP's superiority over existing techniques, achieving a 20.8% increase in attack success rates while reducing query costs by 62.7%. GAP consistently outperforms state-of-the-art methods for attacking both open and closed LLMs, with attack success rates of >96%. Additionally, we present specialized variants like GAP-Auto for automated seed generation and GAP-VLM for multimodal attacks. GAP-generated prompts prove highly effective in improving content moderation systems, increasing true positive detection rates by 108.5% and accuracy by 183.6% when used for fine-tuning. Our implementation is available at https://github.com/dsbuddy/GAP-LLM-Safety.
中文: 本文提出的GAP框架采用互联图结构生成隐蔽的越狱提示,在将查询成本降低62.7%的同时使攻击成功率提升20.8%,用于微调时还能将内容审核系统的检测准确率提升183.6%。
English: This paper presents the GAP framework, which uses an interconnected graph structure to generate stealthy jailbreak prompts, significantly improving attack success rates by 20.8% while reducing query costs by 62.7% and enhancing content moderation systems when used for fine-tuning.

Authors:Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal
Title: Differentially Private Steering for Large Language Model Alignment
Abstract:
Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the Private Steering for LLM Alignment (PSA) algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our experiments support the theoretical guarantees by showing improved guarantees for our PSA algorithm compared to several existing non-private techniques.
中文摘要:本研究提出具有差分隐私保证的私有引导对齐算法(PSA),通过激活编辑在保护私有数据集的同时对齐大语言模型,在七个基准测试中以最小性能损失实现有效对齐。
English Summary: This study introduces the Private Steering for LLM Alignment (PSA) algorithm, which uses differentially private activation editing to align large language models with private datasets while minimizing performance loss across seven benchmarks.

Authors:Benjamin Feuer, Chinmay Hegde
Title: WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training
Abstract:
Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.
中文:WILDCHAT-50M作为最大公开聊天数据集的推出,支持对语言模型后训练技术进行广泛比较分析,并证明新SFT混合方法能以更少样本实现更优性能。
English: The introduction of WILDCHAT-50M, the largest public chat dataset, enables extensive comparative analysis of language model post-training techniques and demonstrates superior performance with a new SFT mixture using significantly fewer samples.

Authors:Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, Bowen Zhou
Title: MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
Abstract:
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on \benchmark. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models. Code and data are available at: https://github.com/TsinghuaC3I/MedXpertQA
中文:MedXpertQA是一个包含4,460道跨17个专科医学难题的权威评测基准,通过严格筛选和多模态临床数据,专门评估超越传统问答的高级推理能力。
English: MedXpertQA is a challenging medical benchmark featuring 4,460 expert-level questions across 17 specialties, incorporating rigorous filtering and multimodal clinical data to evaluate advanced reasoning beyond traditional QA pairs.

Authors:Yibo Wang, Tiansheng Huang, Li Shen, Huanjin Yao, Haotian Luo, Rui Liu, Naiqiang Tan, Jiaxing Huang, Dacheng Tao
Title: Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation
Abstract:
Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea
中文: 研究发现现有针对有害微调攻击的防御措施效果不佳,但简单的随机扰动方法可降低风险,尽管会牺牲微调性能,因此提出Panacea方案,通过自适应扰动在保持安全的同时不影响性能。
English: The study reveals that current defenses against harmful fine-tuning attacks are ineffective, but a simple random perturbation method can mitigate risks, though it compromises fine-tuning performance, leading to the development of Panacea, which uses adaptive perturbations to maintain safety without performance loss.

Authors:Kumar Ashutosh, Yossi Gandelsman, Xinlei Chen, Ishan Misra, Rohit Girdhar
Title: LLMs can see and hear without any training
Abstract:
We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM. Leveraging their innate ability to perform multi-step reasoning, MILS prompts the LLM to generate candidate outputs, each of which are scored and fed back iteratively, eventually generating a solution to the task. This enables various applications that typically require training specialized models on task-specific data. In particular, we establish a new state-of-the-art on emergent zero-shot image, video and audio captioning. MILS seamlessly applies to media generation as well, discovering prompt rewrites to improve text-to-image generation, and even edit prompts for style transfer! Finally, being a gradient-free optimization approach, MILS can invert multimodal embeddings into text, enabling applications like cross-modal arithmetic.
中文: MILS是一种无需训练的方法,通过迭代生成和评分候选答案来增强LLM的多模态能力,在图像、视频和音频描述等任务中实现了最先进的性能。
English: MILS is a training-free method that enhances multimodal capabilities in LLMs through iterative candidate generation and scoring, achieving state-of-the-art performance in tasks like captioning and media generation.

Authors:Rui Min, Tianyu Pang, Chao Du, Qian Liu, Minhao Cheng, Min Lin
Title: Improving Your Model Ranking on Chatbot Arena by Vote Rigging
Abstract:
Chatbot Arena is a popular platform for evaluating LLMs by pairwise battles, where users vote for their preferred response from two randomly sampled anonymous models. While Chatbot Arena is widely regarded as a reliable LLM ranking leaderboard, we show that crowdsourced voting can be rigged to improve (or decrease) the ranking of a target model $m_{t}$. We first introduce a straightforward target-only rigging strategy that focuses on new battles involving $m_{t}$, identifying it via watermarking or a binary classifier, and exclusively voting for $m_{t}$ wins. However, this strategy is practically inefficient because there are over $190$ models on Chatbot Arena and on average only about $1\%$ of new battles will involve $m_{t}$. To overcome this, we propose omnipresent rigging strategies, exploiting the Elo rating mechanism of Chatbot Arena that any new vote on a battle can influence the ranking of the target model $m_{t}$, even if $m_{t}$ is not directly involved in the battle. We conduct experiments on around $1.7$ million historical votes from the Chatbot Arena Notebook, showing that omnipresent rigging strategies can improve model rankings by rigging only hundreds of new votes. While we have evaluated several defense mechanisms, our findings highlight the importance of continued efforts to prevent vote rigging. Our code is available at https://github.com/sail-sg/Rigging-ChatbotArena.
中文: Chatbot Arena的众包投票系统易受操纵策略影响,通过利用Elo评分机制,即使少量投票也能操控模型排名,凸显了加强防御的必要性。
English: Chatbot Arena's crowdsourced voting system is vulnerable to rigging strategies that can manipulate model rankings by exploiting the Elo rating mechanism, even with minimal vote interference, highlighting the need for stronger defenses.

Authors:Fabrizio Sandri, Elia Cunegatti, Giovanni Iacca
Title: 2SSP: A Two-Stage Framework for Structured Pruning of LLMs
Abstract:
We propose a novel Two-Stage framework for Structured Pruning (\textsc{2SSP}) for pruning Large Language Models (LLMs), which combines two different strategies of pruning, namely Width and Depth Pruning. The first stage (Width Pruning) removes entire neurons, hence their corresponding rows and columns, aiming to preserve the connectivity among the pruned structures in the intermediate state of the Feed-Forward Networks in each Transformer block. This is done based on an importance score measuring the impact of each neuron on the output magnitude. The second stage (Depth Pruning), instead, removes entire Attention submodules. This is done by applying an iterative process that removes the Attention with the minimum impact on a given metric of interest (in our case, perplexity). We also propose a novel mechanism to balance the sparsity rate of the two stages w.r.t. to the desired global sparsity. We test \textsc{2SSP} on four LLM families and three sparsity rates (25\%, 37.5\%, and 50\%), measuring the resulting perplexity over three language modeling datasets as well as the performance over six downstream tasks. Our method consistently outperforms five state-of-the-art competitors over three language modeling and six downstream tasks, with an up to two-order-of-magnitude gain in terms of pruning time. The code is available at https://github.com/FabrizioSandri/2SSP.
中文: 我们提出了一种新颖的两阶段结构化剪枝框架(2SSP),通过结合宽度剪枝和深度剪枝来缩减大语言模型的规模,在保持性能的同时显著优于现有方法并大幅提升剪枝效率。
English: We introduce a novel Two-Stage Structured Pruning (2SSP) framework for Large Language Models that combines width and depth pruning to reduce model size while maintaining performance, outperforming existing methods with significant efficiency gains.

Authors:Wonbin Kweon, Sanghwan Jang, SeongKu Kang, Hwanjo Yu
Title: Uncertainty Quantification and Decomposition for LLM-based Recommendation
Abstract:
Despite the widespread adoption of large language models (LLMs) for recommendation, we demonstrate that LLMs often exhibit uncertainty in their recommendations. To ensure the trustworthy use of LLMs in generating recommendations, we emphasize the importance of assessing the reliability of recommendations generated by LLMs. We start by introducing a novel framework for estimating the predictive uncertainty to quantitatively measure the reliability of LLM-based recommendations. We further propose to decompose the predictive uncertainty into recommendation uncertainty and prompt uncertainty, enabling in-depth analyses of the primary source of uncertainty. Through extensive experiments, we (1) demonstrate predictive uncertainty effectively indicates the reliability of LLM-based recommendations, (2) investigate the origins of uncertainty with decomposed uncertainty measures, and (3) propose uncertainty-aware prompting for a lower predictive uncertainty and enhanced recommendation. Our source code and model weights are available at https://github.com/WonbinKweon/UNC_LLM_REC_WWW2025
中文: 大语言模型在推荐中常表现出不确定性,为此我们提出了一个评估框架,通过分解预测不确定性来量化可靠性,实验证明该方法能有效指导优化并提升推荐质量。
English: Large language models often show uncertainty in recommendations, so we developed a framework to measure and decompose this uncertainty, proving it effectively indicates reliability and can enhance recommendations through uncertainty-aware prompting.

Authors:Shreya Shukla, Prajwal Gatti, Yogesh Kumar, Vikash Yadav, Anand Mishra
Title: Towards Making Flowchart Images Machine Interpretable
Abstract:
Computer programming textbooks and software documentations often contain flowcharts to illustrate the flow of an algorithm or procedure. Modern OCR engines often tag these flowcharts as graphics and ignore them in further processing. In this paper, we work towards making flowchart images machine-interpretable by converting them to executable Python codes. To this end, inspired by the recent success in natural language to code generation literature, we present a novel transformer-based framework, namely FloCo-T5. Our model is well-suited for this task,as it can effectively learn semantics, structure, and patterns of programming languages, which it leverages to generate syntactically correct code. We also used a task-specific pre-training objective to pre-train FloCo-T5 using a large number of logic-preserving augmented code samples. Further, to perform a rigorous study of this problem, we introduce theFloCo dataset that contains 11,884 flowchart images and their corresponding Python codes. Our experiments show promising results, and FloCo-T5 clearly outperforms related competitive baselines on code generation metrics. We make our dataset and implementation publicly available.

Authors:Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu
Title: Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation
Abstract:
Recent research shows that Large Language Models (LLMs) are vulnerable to harmful fine-tuning attacks -- models lose their safety alignment ability after fine-tuning on a few harmful samples. For risk mitigation, a guardrail is typically used to filter out harmful samples before fine-tuning. By designing a new red-teaming method, we in this paper show that purely relying on the moderation guardrail for data filtration is not reliable. Our proposed attack method, dubbed Virus, easily bypasses the guardrail moderation by slightly modifying the harmful data. Experimental results show that the harmful data optimized by Virus is not detectable by the guardrail with up to 100\% leakage ratio, and can simultaneously achieve superior attack performance. Finally, the key message we want to convey through this paper is that: \textbf{it is reckless to consider guardrail moderation as a clutch at straws towards harmful fine-tuning attack}, as it cannot solve the inherent safety issue of the pre-trained LLMs. Our code is available at https://github.com/git-disl/Virus
Chinese: 本研究表明,仅依赖护栏审核过滤有害数据不可靠,因为提出的Virus攻击方法能通过微调有害样本轻松绕过防护,暴露了预训练大语言模型固有的安全隐患。
English: This study reveals that relying solely on guardrail moderation to filter harmful data is unreliable, as the proposed Virus attack method can bypass it by subtly modifying harmful samples, exposing the inherent safety vulnerabilities of pre-trained large language models.

Authors:David Salinas, Omar Swelam, Frank Hutter
Title: Tuning LLM Judge Design Decisions for 1/1000 of the Cost
Abstract:
Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging. In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trade accuracy for cost and also significantly reduce the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility. The code to reproduce our experiments is available at this repository https://github.com/geoalgo/judgetuning .
Chinese: 本文提出一种系统性优化方法,通过多目标多保真度优化调整超参数,使基于大语言模型的评估器在采用开源权重模型时,实现了更高的准确率、成本效益及可复现性。
English: This paper introduces a systematic method to optimize LLM-based judges by tuning hyperparameters using multi-objective multi-fidelity optimization, achieving higher accuracy and cost-efficiency with open-weight models for better accessibility.

Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Title: Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models
Abstract:
Large pre-trained models have achieved outstanding results in sequence modeling. The Transformer block and its attention mechanism have been the main drivers of the success of these models. Recently, alternative architectures, such as Selective Structured State Space Models (SSMs), have been proposed to address the inefficiencies of Transformers. This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance. The code is available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
Chinese Summary: 本文提出Mamba-Shedder方法,通过压缩选择性结构化状态空间模型来减少计算开销和模型规模,同时保持精度,实现了高达1.4倍的推理加速。
English Summary: This paper introduces Mamba-Shedder, a method for compressing selective structured state space models to reduce computational overhead and model size while preserving accuracy, achieving up to 1.4x inference speedup.

Authors:Sunbowen Lee, Shiwen Ni, Chi Wei, Shuaimin Li, Liyang Fan, Ahmadreza Argha, Hamid Alinejad-Rokny, Ruifeng Xu, Yicheng Gong, Min Yang
Title: xJailbreak: Representation Space Guided Reinforcement Learning for Interpretable LLM Jailbreaking
Abstract:
Safety alignment mechanism are essential for preventing large language models (LLMs) from generating harmful information or unethical content. However, cleverly crafted prompts can bypass these safety measures without accessing the model's internal parameters, a phenomenon known as black-box jailbreak. Existing heuristic black-box attack methods, such as genetic algorithms, suffer from limited effectiveness due to their inherent randomness, while recent reinforcement learning (RL) based methods often lack robust and informative reward signals. To address these challenges, we propose a novel black-box jailbreak method leveraging RL, which optimizes prompt generation by analyzing the embedding proximity between benign and malicious prompts. This approach ensures that the rewritten prompts closely align with the intent of the original prompts while enhancing the attack's effectiveness. Furthermore, we introduce a comprehensive jailbreak evaluation framework incorporating keywords, intent matching, and answer validation to provide a more rigorous and holistic assessment of jailbreak success. Experimental results show the superiority of our approach, achieving state-of-the-art (SOTA) performance on several prominent open and closed-source LLMs, including Qwen2.5-7B-Instruct, Llama3.1-8B-Instruct, and GPT-4o-0806. Our method sets a new benchmark in jailbreak attack effectiveness, highlighting potential vulnerabilities in LLMs. The codebase for this work is available at https://github.com/Aegis1863/xJailbreak.
中文摘要:本研究提出一种基于强化学习的黑盒越狱方法,通过分析良性提示与恶意提示的嵌入邻近性来优化提示生成,在多个大语言模型上实现最优性能,并建立了全面的越狱评估框架。
English Summary: The proposed reinforcement learning-based black-box jailbreak method enhances attack effectiveness by optimizing prompts through embedding proximity analysis, achieving state-of-the-art performance on multiple LLMs while introducing a comprehensive evaluation framework.

Authors:Aashish Yadavally, Hoan Nguyen, Laurent Callot, Gauthier Guinet
Title: Large Language Model Critics for Execution-Free Evaluation of Code Changes
Abstract:
Large language models (LLMs) offer a promising way forward for automating software engineering tasks, such as bug fixes, feature additions, etc., via multi-step LLM-based agentic workflows. However, existing metrics for evaluating such workflows, mainly build status and occasionally log analysis, are too sparse and limited in providing the information needed to assess the quality of changes made. In this work, we designed LLM-based critics to derive well-structured and rigorous intermediate/step-level, execution-free evaluation proxies for repo-level code changes. Importantly, we assume access to the gold test patch for the problem (i.e., reference-aware) to assess both semantics and executability of generated patches. With the gold test patch as a reference, we predict executability of all editing locations with an F1 score of 91.6%, aggregating which, we can predict the build status in 84.8% of the instances in SWE-bench. In particular, such an execution-focused LLM critic outperforms other reference-free and reference-aware LLM critics by 38.9% to 72.5%. Moreover, we demonstrate the usefulness of such a reference-aware framework in comparing patches generated by different agentic workflows. Finally, we open-source the library developed for this project, which allows further usage for either other agentic workflows or other benchmarks. The source code is available at https://github.com/amazon-science/code-agent-eval.
中文摘要:基于大语言模型的评审机制被设计用于提供严格的分步评估指标,通过参考黄金测试补丁来高精度评估代码语义和可执行性,显著优于其他方法,并能有效比较不同智能工作流程的补丁质量。
English Summary: Large language model-based critics are designed to provide rigorous, step-level evaluation proxies for code changes, using gold test patches to assess semantics and executability with high accuracy, outperforming other methods and enabling effective comparison of agentic workflows.

Authors:Jinlan Fu, Shenzhen Huangfu, Hao Fei, Xiaoyu Shen, Bryan Hooi, Xipeng Qiu, See-Kiong Ng
Title: CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs
Abstract:
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities. Recent studies have attempted to mitigate this by applying Direct Preference Optimization (DPO) to multimodal scenarios using preference pairs from text-based responses. However, our analysis of representation distributions reveals that multimodal DPO struggles to align image and text representations and to distinguish between hallucinated and non-hallucinated descriptions. To address these challenges, in this work, we propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations. We introduce a visual preference optimization module within the DPO framework, enabling MLLMs to learn from both textual and visual preferences simultaneously. Furthermore, we propose a hierarchical textual preference optimization module that allows the model to capture preferences at multiple granular levels, including response, segment, and token levels. We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations. On the Object HalBench dataset, CHiP outperforms DPO in hallucination reduction, achieving improvements of 52.7% and 55.5% relative points based on the base model Muffin and LLaVA models, respectively. We make all our datasets and code publicly available: https://github.com/LVUGAI/CHiP.
中文: 提出的跨模态分层直接偏好优化(CHiP)方法通过整合视觉和分层文本偏好,有效减少多模态模型中的幻觉现象,在多个基准测试中表现优于现有技术。
English: The proposed Cross-modal Hierarchical Direct Preference Optimization (CHiP) method enhances multimodal models by integrating visual and hierarchical textual preferences, significantly reducing hallucinations and outperforming existing techniques on benchmarks.

Authors:Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, Graham Neubig
Title: CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Abstract:
While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fall short on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent's capabilities effectively. We propose CowPilot, a framework supporting autonomous as well as human-agent collaborative web navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

Authors:J. Pablo Muñoz, Jinjie Yuan, Nilesh Jain
Title: Low-Rank Adapters Meet Neural Architecture Search for LLM Compression
Abstract:
The rapid expansion of Large Language Models (LLMs) has posed significant challenges regarding the computational resources required for fine-tuning and deployment. Recent advancements in low-rank adapters have demonstrated their efficacy in parameter-efficient fine-tuning (PEFT) of these models. This retrospective paper comprehensively discusses innovative approaches that synergize low-rank representations with Neural Architecture Search (NAS) techniques, particularly weight-sharing super-networks. Robust solutions for compressing and fine-tuning large pre-trained models are developed by integrating these methodologies. Our analysis highlights the potential of these combined strategies to democratize the use of LLMs, making them more accessible for deployment in resource-constrained environments. The resulting models exhibit reduced memory footprints and faster inference times, paving the way for more practical and scalable applications of LLMs. Models and code are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.
中文: 本文探讨了将低秩适配器与神经架构搜索相结合的方法,以高效微调和压缩大型语言模型,使其在资源受限环境中实现内存减少和推理加速的实际部署。
English: This paper explores combining low-rank adapters with Neural Architecture Search to efficiently fine-tune and compress Large Language Models, enabling their practical deployment in resource-limited settings with reduced memory and faster inference.

Authors:Heting Gao, Hang Shao, Xiong Wang, Chaofan Qiu, Yunhang Shen, Siqi Cai, Yuchen Shi, Zihan Xu, Zuwei Long, Yike Zhang, Shaoqi Dong, Chaoyou Fu, Ke Li, Long Ma, Xing Sun
Title: LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
Abstract:
The film Her features Samantha, a sophisticated AI audio agent who is capable of understanding both linguistic and paralinguistic information in human speech and delivering real-time responses that are natural, informative and sensitive to emotional subtleties. Moving one step toward more sophisticated audio agent from recent advancement in end-to-end (E2E) speech systems, we propose LUCY, a E2E speech model that (1) senses and responds to user's emotion, (2) deliver responses in a succinct and natural style, and (3) use external tool to answer real-time inquiries. Experiment results show that LUCY is better at emotion control than peer models, generating emotional responses based on linguistic emotional instructions and responding to paralinguistic emotional cues. Lucy is also able to generate responses in a more natural style, as judged by external language models, without sacrificing much performance on general question answering. Finally, LUCY can leverage function calls to answer questions that are out of its knowledge scope.
中文:电影《她》中的AI萨曼莎能理解语言和情感线索,启发了LUCY这一端到端语音模型的开发,该模型在情感响应、自然对话生成及调用外部工具处理实时查询方面表现卓越。
English: The film "Her" depicts Samantha, an advanced AI that comprehends both linguistic and emotional cues in speech, inspiring the development of LUCY, an end-to-end speech model that excels in emotional responsiveness, natural dialogue generation, and utilizing external tools for real-time inquiries.

Authors:Weixin Liang, Junhong Shen, Genghan Zhang, Ning Dong, Luke Zettlemoyer, Lili Yu
Title: Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
Abstract:
State Space Models (SSMs) have emerged as efficient alternatives to Transformers for sequential modeling, but their inability to leverage modality-specific features limits their performance in multi-modal pretraining. Here, we propose Mixture-of-Mamba, a novel SSM architecture that introduces modality-aware sparsity through modality-specific parameterization of the Mamba block. Building on Mixture-of-Transformers (W. Liang et al. arXiv:2411.04996; 2024), we extend the benefits of modality-aware sparsity to SSMs while preserving their computational efficiency. We evaluate Mixture-of-Mamba across three multi-modal pretraining settings: Transfusion (interleaved text and continuous image tokens with diffusion loss), Chameleon (interleaved text and discrete image tokens), and an extended three-modality framework incorporating speech. Mixture-of-Mamba consistently reaches the same loss values at earlier training steps with significantly reduced computational costs. In the Transfusion setting, Mixture-of-Mamba achieves equivalent image loss using only 34.76% of the training FLOPs at the 1.4B scale. In the Chameleon setting, Mixture-of-Mamba reaches similar image loss with just 42.50% of the FLOPs at the 1.4B scale, and similar text loss with just 65.40% of the FLOPs. In the three-modality setting, MoM matches speech loss at 24.80% of the FLOPs at the 1.4B scale. Our ablation study highlights the synergistic effects of decoupling projection components, where joint decoupling yields greater gains than individual modifications. These results establish modality-aware sparsity as a versatile and effective design principle, extending its impact from Transformers to SSMs and setting new benchmarks in multi-modal pretraining. Our code can be accessed at https://github.com/Weixin-Liang/Mixture-of-Mamba
中文: Mixture-of-Mamba通过模态感知稀疏性改进了状态空间模型,在文本、图像和语音的多模态预训练中,能以显著降低的计算成本实现同等性能。
English: Mixture-of-Mamba introduces modality-aware sparsity into State Space Models, enabling efficient multi-modal pretraining by achieving comparable performance with significantly reduced computational costs across text, image, and speech tasks.

Authors:Xiang Huang, Hao Peng, Shuo Sun, Zhifeng Hao, Hui Lin, Shuhai Wang
Title: Multi-View Attention Syntactic Enhanced Graph Convolutional Network for Aspect-based Sentiment Analysis
Abstract:
Aspect-based Sentiment Analysis (ABSA) is the task aimed at predicting the sentiment polarity of aspect words within sentences. Recently, incorporating graph neural networks (GNNs) to capture additional syntactic structure information in the dependency tree derived from syntactic dependency parsing has been proven to be an effective paradigm for boosting ABSA. Despite GNNs enhancing model capability by fusing more types of information, most works only utilize a single topology view of the dependency tree or simply conflate different perspectives of information without distinction, which limits the model performance. To address these challenges, in this paper, we propose a new multi-view attention syntactic enhanced graph convolutional network (MASGCN) that weighs different syntactic information of views using attention mechanisms. Specifically, we first construct distance mask matrices from the dependency tree to obtain multiple subgraph views for GNNs. To aggregate features from different views, we propose a multi-view attention mechanism to calculate the attention weights of views. Furthermore, to incorporate more syntactic information, we fuse the dependency type information matrix into the adjacency matrices and present a structural entropy loss to learn the dependency type adjacency matrix. Comprehensive experiments on four benchmark datasets demonstrate that our model outperforms state-of-the-art methods. The codes and datasets are available at https://github.com/SELGroup/MASGCN.
中文: 本文提出MASGCN模型,通过注意力机制融合依存树的多视角句法信息来提升方面级情感分析性能,在多个基准数据集上实现了最优表现。
English: This paper introduces MASGCN, a multi-view attention graph network that enhances aspect-based sentiment analysis by integrating diverse syntactic perspectives from dependency trees, outperforming existing methods on benchmark datasets.

Authors:Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, Yiqun Liu
Title: Parametric Retrieval Augmented Generation
Abstract:
Retrieval-augmented generation (RAG) techniques have emerged as a promising solution to enhance the reliability of large language models (LLMs) by addressing issues like hallucinations, outdated knowledge, and domain adaptation. In particular, existing RAG methods append relevant documents retrieved from external corpus or databases to the input of LLMs to guide their generation process, which we refer to as the in-context knowledge injection method. While this approach is simple and often effective, it has inherent limitations. Firstly, increasing the context length and number of relevant documents can lead to higher computational overhead and degraded performance, especially in complex reasoning tasks. More importantly, in-context knowledge injection operates primarily at the input level, but LLMs store their internal knowledge in their parameters. This gap fundamentally limits the capacity of in-context methods. To this end, we introduce Parametric retrieval-augmented generation (Parametric RAG), a new RAG paradigm that integrates external knowledge directly into the parameters of feed-forward networks (FFN) of an LLM through document parameterization. This approach not only saves online computational costs by eliminating the need to inject multiple documents into the LLMs' input context, but also deepens the integration of external knowledge into the parametric knowledge space of the LLM. Experimental results demonstrate that Parametric RAG substantially enhances both the effectiveness and efficiency of knowledge augmentation in LLMs. Also, it can be combined with in-context RAG methods to achieve even better performance. We have open-sourced all the code, data, and models in the following anonymized GitHub link: https://github.com/oneal2000/PRAG
中文: 检索增强生成(RAG)技术通过解决幻觉和知识过时问题提升大语言模型可靠性,但现有方法存在计算效率与知识融合的局限;提出的参数化RAG通过将外部知识直接嵌入模型参数,显著提升了增强效果与效率。
English: Retrieval-augmented generation (RAG) techniques enhance LLM reliability by addressing hallucinations and outdated knowledge, but existing methods face limitations in computational efficiency and knowledge integration; the proposed Parametric RAG overcomes these by embedding external knowledge directly into model parameters, improving both effectiveness and efficiency.

Authors:Kentaro Kurihara, Masato Mita, Peinan Zhang, Shota Sasaki, Ryosuke Ishigami, Naoaki Okazaki
Title: LCTG Bench: LLM Controlled Text Generation Benchmark
Abstract:
The rise of large language models (LLMs) has led to more diverse and higher-quality machine-generated text. However, their high expressive power makes it difficult to control outputs based on specific business instructions. In response, benchmarks focusing on the controllability of LLMs have been developed, but several issues remain: (1) They primarily cover major languages like English and Chinese, neglecting low-resource languages like Japanese; (2) Current benchmarks employ task-specific evaluation metrics, lacking a unified framework for selecting models based on controllability across different use cases. To address these challenges, this research introduces LCTG Bench, the first Japanese benchmark for evaluating the controllability of LLMs. LCTG Bench provides a unified framework for assessing control performance, enabling users to select the most suitable model for their use cases based on controllability. By evaluating nine diverse Japanese-specific and multilingual LLMs like GPT-4, we highlight the current state and challenges of controllability in Japanese LLMs and reveal the significant gap between multilingual models and Japanese-specific models.
中文:本研究推出了首个日语基准LCTG Bench,旨在解决低资源语言中大型语言模型可控性评估框架缺失的问题,揭示了日语专用模型与多语言模型之间的显著性能差距。
English: This research introduces LCTG Bench, the first Japanese benchmark addressing the lack of unified controllability evaluation for LLMs in low-resource languages, revealing a significant performance gap between Japanese-specific and multilingual models.

Authors:Edoardo Cetin, Tianyu Zhao, Yujin Tang
Title: Large Language Models to Diffusion Finetuning
Abstract:
We propose a new finetuning method to provide pre-trained large language models (LMs) the ability to scale test-time compute through the diffusion framework. By increasing the number of diffusion steps, we show our finetuned models achieve monotonically increasing accuracy, directly translating to improved performance across downstream tasks. Furthermore, our finetuned models can expertly answer questions on specific topics by integrating powerful guidance techniques, and autonomously determine the compute required for a given problem by leveraging adaptive ODE solvers. Our method is universally applicable to any foundation model pre-trained with a cross-entropy loss and does not modify any of its original weights, fully preserving its strong single-step generation capabilities. We show our method is more effective and fully compatible with traditional finetuning approaches, introducing an orthogonal new direction to unify the strengths of the autoregressive and diffusion frameworks.
中文: 本研究提出一种微调方法,使预训练大语言模型能够通过扩散框架扩展测试时计算量,在不改变原始模型权重的前提下提升准确性和任务表现。
English: This study introduces a finetuning method that enables pre-trained large language models to scale test-time compute using the diffusion framework, enhancing accuracy and task performance without altering original model weights.

Authors:Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic
Title: TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs
Abstract:
The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only architectures, while achieving compression rates of up to $\sim 250$ times in the MHA weights, all without requiring any additional data, training, or fine-tuning. Furthermore, we show that the proposed method can be seamlessly combined with existing FFN-only-based denoising techniques to achieve further improvements in LLM reasoning performance.
Chinese: 本研究提出一种新颖框架,通过多头注意力张量化和塔克分解对权重进行结构化去噪与压缩,无需额外数据或训练即可实现高达250倍的压缩,并显著提升大语言模型的推理能力。
English: This study introduces a novel framework that enhances Large Language Models' reasoning by structurally denoising and compressing Multi-head Attention weights through tensorization and Tucker decomposition, achieving up to 250x compression and improved performance without extra data or training.

Authors:Zeyu Gan, Yun Liao, Yong Liu
Title: Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning
Abstract:
Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code at https://github.com/ZyGan1999/Snowball-Errors-and-Probability.
中文: 测试时扩展(即慢思考)通过扩大搜索范围或增强推理能力来降低错误概率,从而提升大型语言模型的多步推理能力,其效果主要不依赖于特定框架。
English: Test-time scaling, or slow-thinking, improves multi-step reasoning in LLMs by mitigating error probability through expanding search scope or enhancing reasoning capacity, with efficacy not primarily dependent on specific frameworks.

Authors:Dakuan Lu, Xiaoyu Tan, Rui Xu, Tianchu Yao, Chao Qu, Wei Chu, Yinghui Xu, Yuan Qi
Title: SCP-116K: A High-Quality Problem-Solution Dataset and a Generalized Pipeline for Automated Extraction in the Higher Education Science Domain
Abstract:
Recent breakthroughs in large language models (LLMs) exemplified by the impressive mathematical and scientific reasoning capabilities of the o1 model have spotlighted the critical importance of high-quality training data in advancing LLM performance across STEM disciplines. While the mathematics community has benefited from a growing body of curated datasets, the scientific domain at the higher education level has long suffered from a scarcity of comparable resources. To address this gap, we present SCP-116K, a new large-scale dataset of 116,756 high-quality problem-solution pairs, automatically extracted from heterogeneous sources using a streamlined and highly generalizable pipeline. Our approach involves stringent filtering to ensure the scientific rigor and educational level of the extracted materials, while maintaining adaptability for future expansions or domain transfers. By openly releasing both the dataset and the extraction pipeline, we seek to foster research on scientific reasoning, enable comprehensive performance evaluations of new LLMs, and lower the barrier to replicating the successes of advanced models like o1 in the broader science community. We believe SCP-116K will serve as a critical resource, catalyzing progress in high-level scientific reasoning tasks and promoting further innovations in LLM development. The dataset and code are publicly available at https://github.com/AQA6666/SCP-116K-open.
中文摘要:SCP-116K数据集通过自动化流程构建了11.6万组高质量科学问题解决方案,填补了高等教育STEM领域数据资源的空白,其开源发布将推动大语言模型的科学推理能力发展并降低先进模型复现门槛。
English Summary: The SCP-116K dataset introduces 116,756 high-quality scientific problem-solution pairs to address the scarcity of STEM resources at higher education levels, providing an open resource to advance LLM reasoning capabilities and replicate successes like the o1 model.

Authors:Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao
Title: ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer
Abstract:
As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at \href{https://github.com/yynil/RWKVInside}{https://github.com/yynil/RWKVInside}, \href{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.
中文: 本研究基于纯原生RWKV-7注意力机制,推出了从Qwen 2.5蒸馏而来的系列模型,旨在提升RNN的表达能力和状态追踪性能以超越Transformer,同时展示了基于RWKV-6架构的QRWK 32B模型,仅用16个GPU在8小时内即可完成知识处理。
English: This research introduces a series of models derived from Qwen 2.5 using pure RWKV-7 attention, aiming to enhance RNN expressiveness and state tracking beyond transformers, while also presenting QRWK 32B based on RWKV-6 for efficient knowledge processing in just 8 hours with 16 GPUs.

Authors:Guanglin Niu, Xiaowei Zhang
Title: Diffusion-based Hierarchical Negative Sampling for Multimodal Knowledge Graph Completion
Abstract:
Multimodal Knowledge Graph Completion (MMKGC) aims to address the critical issue of missing knowledge in multimodal knowledge graphs (MMKGs) for their better applications. However, both the previous MMGKC and negative sampling (NS) approaches ignore the employment of multimodal information to generate diverse and high-quality negative triples from various semantic levels and hardness levels, thereby limiting the effectiveness of training MMKGC models. Thus, we propose a novel Diffusion-based Hierarchical Negative Sampling (DHNS) scheme tailored for MMKGC tasks, which tackles the challenge of generating high-quality negative triples by leveraging a Diffusion-based Hierarchical Embedding Generation (DiffHEG) that progressively conditions on entities and relations as well as multimodal semantics. Furthermore, we develop a Negative Triple-Adaptive Training (NTAT) strategy that dynamically adjusts training margins associated with the hardness level of the synthesized negative triples, facilitating a more robust and effective learning procedure to distinguish between positive and negative triples. Extensive experiments on three MMKGC benchmark datasets demonstrate that our framework outperforms several state-of-the-art MMKGC models and negative sampling techniques, illustrating the effectiveness of our DHNS for training MMKGC models. The source codes and datasets of this paper are available at https://github.com/ngl567/DHNS.
Chinese: 本研究针对多模态知识图谱补全任务,提出了一种基于扩散的分层负采样框架,通过利用多模态信息生成高质量负三元组并结合自适应训练策略,显著提升了模型性能,在多个基准数据集上验证了其有效性。
English: This study introduces a Diffusion-based Hierarchical Negative Sampling (DHNS) framework for Multimodal Knowledge Graph Completion, which generates high-quality negative triples using multimodal information and adaptive training to enhance model performance, as validated by superior results on benchmark datasets.

Authors:Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
Title: Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning
Abstract:
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models, thereby minimizing hallucinations. A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation. However, these components are typically optimized separately through supervised fine-tuning, which can lead to misalignments between the objectives of individual modules and the overarching aim of generating accurate answers in question-answering (QA) tasks. Although recent efforts have explored reinforcement learning (RL) to optimize specific RAG components, these approaches often focus on overly simplistic pipelines with only two components or do not adequately address the complex interdependencies and collaborative interactions among the modules. To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent. Specifically, we present MMOA-RAG, a Multi-Module joint Optimization Algorithm for RAG, which employs multi-agent reinforcement learning to harmonize all agents' goals towards a unified reward, such as the F1 score of the final answer. Experiments conducted on various QA datasets demonstrate that MMOA-RAG improves the overall pipeline performance and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and the adaptability of MMOA-RAG across different RAG components and datasets. The code of MMOA-RAG is on https://github.com/chenyiqun/MMOA-RAG.
Chinese: 本文提出MMOA-RAG,通过多智能体强化学习将检索增强生成流程中的各个组件作为智能体进行联合优化,使所有模块目标统一于最终答案的F1分数等奖励指标,在多项问答任务中超越了现有基线方法。
English: The paper introduces MMOA-RAG, a multi-agent reinforcement learning approach that optimizes the entire retrieval-augmented generation pipeline by aligning all components toward a unified reward, improving performance on question-answering tasks over existing methods.

Authors:Hulingxiao He, Geng Li, Zijun Geng, Jinglin Xu, Yuxin Peng
Title: Analyzing and Boosting the Power of Fine-Grained Visual Recognition for Multi-modal Large Language Models
Abstract:
Multi-modal large language models (MLLMs) have shown remarkable abilities in various visual understanding tasks. However, MLLMs still struggle with fine-grained visual recognition (FGVR), which aims to identify subordinate-level categories from images. This can negatively impact more advanced capabilities of MLLMs, such as object-centric visual question answering and reasoning. In our study, we revisit three quintessential capabilities of MLLMs for FGVR, including object information extraction, category knowledge reserve, object-category alignment, and position of the root cause as a misalignment problem. To address this issue, we present Finedefics, an MLLM that enhances the model's FGVR capability by incorporating informative attribute descriptions of objects into the training phase. We employ contrastive learning on object-attribute pairs and attribute-category pairs simultaneously and use examples from similar but incorrect categories as hard negatives, naturally bringing representations of visual objects and category names closer. Extensive evaluations across multiple popular FGVR datasets demonstrate that Finedefics outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code is available at https://github.com/PKU-ICST-MIPL/Finedefics_ICLR2025.
Chinese: 本研究提出了Finedefics多模态大语言模型,通过引入物体属性描述并采用包含困难负样本的对比学习,显著提升了细粒度视觉识别能力,在多个数据集上展现出优越性能。
English: The study introduces Finedefics, a multi-modal large language model that enhances fine-grained visual recognition by incorporating object attribute descriptions and using contrastive learning with hard negatives, achieving superior performance on multiple datasets.

Authors:Zhongpu Chen, Yinfeng Liu, Long Shi, Xingyan Chen, Yu Zhao, Fuji Ren
Title: MDEval: Evaluating and Enhancing Markdown Awareness in Large Language Models
Abstract:
Large language models (LLMs) are expected to offer structured Markdown responses for the sake of readability in web chatbots (e.g., ChatGPT). Although there are a myriad of metrics to evaluate LLMs, they fail to evaluate the readability from the view of output content structure. To this end, we focus on an overlooked yet important metric -- Markdown Awareness, which directly impacts the readability and structure of the content generated by these language models. In this paper, we introduce MDEval, a comprehensive benchmark to assess Markdown Awareness for LLMs, by constructing a dataset with 20K instances covering 10 subjects in English and Chinese. Unlike traditional model-based evaluations, MDEval provides excellent interpretability by combining model-based generation tasks and statistical methods. Our results demonstrate that MDEval achieves a Spearman correlation of 0.791 and an accuracy of 84.1% with human, outperforming existing methods by a large margin. Extensive experimental results also show that through fine-tuning over our proposed dataset, less performant open-source models are able to achieve comparable performance to GPT-4o in terms of Markdown Awareness. To ensure reproducibility and transparency, MDEval is open sourced at https://github.com/SWUFE-DB-Group/MDEval-Benchmark.
中文: 本文提出了MDEval基准,用于评估大语言模型的Markdown感知能力,显著提升了可读性和结构评估,实现了与人类判断的高度相关性,并通过微调使开源模型在Markdown感知方面达到与GPT-4o相当的性能。
English: This paper introduces MDEval, a benchmark designed to evaluate the Markdown Awareness of large language models, which significantly improves readability and structure assessment, achieving high correlation with human judgment and enabling open-source models to match GPT-4o's performance through fine-tuning.

Authors:Hongbo Zheng, Suyuan Wang, Neeraj Gangwar, Nickvash Kani
Title: E-Gen: Leveraging E-Graphs to Improve Continuous Representations of Symbolic Expressions
Abstract:
Vector representations have been pivotal in advancing natural language processing (NLP), with prior research focusing on embedding techniques for mathematical expressions using mathematically equivalent formulations. While effective, these approaches are constrained by the size and diversity of training data. In this work, we address these limitations by introducing E-Gen, a novel e-graph-based dataset generation scheme that synthesizes large and diverse mathematical expression datasets, surpassing prior methods in size and operator variety. Leveraging this dataset, we train embedding models using two strategies: (1) generating mathematically equivalent expressions, and (2) contrastive learning to explicitly group equivalent expressions. We evaluate these embeddings on both in-distribution and out-of-distribution mathematical language processing tasks, comparing them against prior methods. Finally, we demonstrate that our embedding-based approach outperforms state-of-the-art large language models (LLMs) on several tasks, underscoring the necessity of optimizing embedding methods for the mathematical data modality. The source code and datasets are available at https://github.com/MLPgroup/E-Gen.
中文: 本研究提出E-Gen,一种基于e-图的数据生成方法,能创建大规模多样化的数学表达式数据集,用于训练嵌入模型,在多项数学处理任务中超越了现有技术和大型语言模型的性能。
English: This study introduces E-Gen, an e-graph-based method that generates large, diverse mathematical expression datasets to train embedding models, which outperform existing techniques and large language models on various mathematical processing tasks.

Authors:Michael K. Chen, Xikun Zhang, Dacheng Tao
Title: JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models
Abstract:
Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic
中文摘要:JustLogic基准通过提供更高的复杂性、独立于先验知识以及深入错误分析能力,解决了现有大语言模型演绎推理评估的不足,实验表明推理型大语言模型仅达到人类平均水平,远未达到人类最佳水平。
English Summary: The JustLogic benchmark addresses limitations in existing deductive reasoning evaluations for LLMs by offering enhanced complexity, independence from prior knowledge, and detailed error analysis capabilities, revealing that while reasoning LLMs match average human performance, they fall short of peak human ability.

Authors:Libo Wang
Title: Wormhole Memory: A Rubik's Cube for Cross-Dialogue Retrieval
Abstract:
In view of the gap in the current large language model in sharing memory across dialogues, this research proposes a wormhole memory module (WMM) to realize memory as a Rubik's cube that can be arbitrarily retrieved between different dialogues. Through simulation experiments, the researcher built an experimental framework based on the Python environment and used setting memory barriers to simulate the current situation where memories between LLMs dialogues are difficult to share. The CoQA development data set was imported into the experiment, and the feasibility of its cross-dialogue memory retrieval function was verified for WMM's nonlinear indexing and dynamic retrieval, and a comparative analysis was conducted with the capabilities of Titans and MemGPT memory modules. Experimental results show that WMM demonstrated the ability to retrieve memory across dialogues and the stability of quantitative indicators in eight experiments. It contributes new technical approaches to the optimization of memory management of LLMs and provides experience for the practical application in the future.
Chinese: 本研究提出了一种虫洞记忆模块(WMM),实现了大语言模型跨对话记忆检索功能,实验验证了其可行性和稳定性,为优化记忆管理提供了新的技术途径。
English: This research introduces a wormhole memory module (WMM) that enables cross-dialogue memory retrieval in large language models, demonstrating its feasibility and stability through experiments and offering new technical approaches for memory management optimization.

Authors:Naihao Deng, Rada Mihalcea
Title: Rethinking Table Instruction Tuning
Abstract:
Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices, and also lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and find significant declines in both out-of-domain table understanding and general capabilities as compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the previous table instruction-tuning work, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduce TAMA, a TAble LLM instruction-tuned from LLaMA 3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection. We open-source the project and our models.
当前表格理解研究忽视了超参数选择和全面评估,导致领域外理解和通用能力显著下降;我们提出的TAMA模型通过优化训练参数,在保持通用能力的同时实现优异性能,甚至超越GPT系列模型,为降低标注成本、提升模型效率提供了新路径。
Recent advances in table understanding through instruction-tuned LLMs overlook hyperparameter impacts and comprehensive evaluation, revealing performance declines in out-of-domain and general capabilities, which our method TAMA addresses with optimized training to match or exceed leading models while preserving versatility.

Authors:Jia Yu, Fei Yuan, Rui Min, Jing Yu, Pei Chu, Jiayang Li, Wei Li, Ruijie Zhang, Zhenxiang Li, Zhifei Ren, Dong Zheng, Wenjian Zhang, Yan Teng, Lingyu Meng, ZhenJiang Jin, Jiantao Qiu, ShaSha Wang, Zhongying Tu, Dahua Lin, Yu Wang, Yu Qiao, Yanfeng Wang, Conghui He
Title: WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages
Abstract:
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus cleaning, content deduplication, security filtering, quality evaluation, and theme classification. Through the implementation of this framework, we have significantly improved both the quality and security of the dataset, while maintaining its linguistic diversity. As of now, data for all five languages have been fully open-sourced. The dataset can be accessed at https://opendatalab.com/applyMultilingualCorpus, and GitHub repository is available at https://github.com/opendatalab/WanJuan3.0
中文: 本文介绍了开源数据集万卷丝路,旨在通过系统化的数据处理框架为低资源语言提供高质量训练语料,提升多语言模型研发,目前五种语言数据已全面开源并可在指定平台获取。
English: This paper presents the open-source WanJuanSiLu dataset, designed to enhance multilingual model development for low-resource languages through a systematic data processing framework that ensures quality, safety, and linguistic diversity, with all five languages' data now fully available online.

Authors:Jie He, Yijun Yang, Wanqiu Long, Deyi Xiong, Victor Gutierrez-Basulto, Jeff Z. Pan
Title: Evaluating and Improving Graph to Text Generation with Large Language Models
Abstract:
Large language models (LLMs) have demonstrated immense potential across various tasks. However, research for exploring and improving the capabilities of LLMs in interpreting graph structures remains limited. To address this gap, we conduct a comprehensive evaluation of prompting current open-source LLMs on graph-to-text generation tasks. Although we explored the optimal prompting strategies and proposed a novel and effective diversity-difficulty-based few-shot sample selection method, we found that the improvements from tuning-free approaches were incremental, as LLMs struggle with planning on complex graphs, particularly those with a larger number of triplets. To further improve LLMs in planning with graph sequences and grounding in truth, we introduce a new graph-to-text dataset, PlanGTG, annotated with two sub-tasks: reordering and attribution. Through extensive automatic and human evaluations, we demonstrate significant improvements in the quality of generated text from both few-shot learning and fine-tuning perspectives using the PlanGTG dataset. Our study paves the way for new research directions in graph-to-text generation. PlanGTG datasets can be found in https://github.com/probe2/kg_text.
Chinese: 大型语言模型在图表转文本任务中潜力巨大但面临挑战,为此我们开发了PlanGTG数据集,通过重排序和归因标注显著提升了文本生成质量。
English: Large language models show potential but face challenges in graph-to-text tasks, leading to the creation of the PlanGTG dataset which significantly improves text generation through reordering and attribution annotations.

Authors:Zhengyang Tang, Ziniu Li, Zhenyang Xiao, Tian Ding, Ruoyu Sun, Benyou Wang, Dayiheng Liu, Fei Huang, Tianyu Liu, Bowen Yu, Junyang Lin
Title: RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques
Abstract:
Critiques are important for enhancing the performance of Large Language Models (LLMs), enabling both self-improvement and constructive feedback for others by identifying flaws and suggesting improvements. However, evaluating the critique capabilities of LLMs presents a significant challenge due to the open-ended nature of the task. In this work, we introduce a new benchmark designed to assess the critique capabilities of LLMs. Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques. Moreover, the benchmark incorporates features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the abilities of advanced reasoning models from more classical ones. We implement this benchmark using eight challenging reasoning tasks. We have several interesting findings. First, despite demonstrating comparable performance in direct chain-of-thought generation, classical LLMs significantly lag behind the advanced reasoning-based model o1-mini across all critique scenarios. Second, in self-critique and iterative critique settings, classical LLMs may even underperform relative to their baseline capabilities. We hope that this benchmark will serve as a valuable resource to guide future advancements. The code and data are available at \url{https://github.com/tangzhy/RealCritic}.
Chinese: 本文提出了一种闭环基准来评估大语言模型的批判能力,发现先进推理模型在自我批判和迭代场景中优于传统模型,而传统模型有时甚至表现低于基准水平。
English: This paper introduces a closed-loop benchmark to evaluate LLMs' critique capabilities, revealing that advanced reasoning models outperform classical LLMs in self-critique and iterative scenarios, with classical models sometimes regressing below baseline performance.

Authors:Xu Chu, Zhijie Tan, Hanlin Xue, Guanyu Wang, Tong Mo, Weiping Li
Title: Domaino1s: Guiding LLM Reasoning for Explainable Answers in High-Stakes Domains
Abstract:
Large Language Models (LLMs) are widely applied to downstream domains. However, current LLMs for high-stakes domain tasks, such as financial investment and legal QA, typically generate brief answers without reasoning processes and explanations. This limits users' confidence in making decisions based on their responses. While original CoT shows promise, it lacks self-correction mechanisms during reasoning. This work introduces Domain$o1$s, which enhances LLMs' reasoning capabilities on domain tasks through supervised fine-tuning and tree search. We construct CoT-stock-2k and CoT-legal-2k datasets for fine-tuning models that activate domain-specific reasoning steps based on their judgment. Additionally, we propose Selective Tree Exploration to spontaneously explore solution spaces and sample optimal reasoning paths to improve performance. We also introduce PROOF-Score, a new metric for evaluating domain models' explainability, complementing traditional accuracy metrics with richer assessment dimensions. Extensive experiments on stock investment recommendation and legal reasoning QA tasks demonstrate Domaino1s's leading performance and explainability. Our code is available at https://github.com/Hyalinesky/Domaino1s.
中文: 本文提出的Domain$o1$s方法通过监督微调和选择性树搜索增强大语言模型在专业领域的推理能力,结合新建评估指标验证了其在金融投资和法律问答任务中具备领先性能与可解释性。
English: This paper introduces Domain$o1$s, a method that enhances LLMs' reasoning and explainability in high-stakes domains through supervised fine-tuning with specialized datasets and selective tree exploration, validated by a new evaluation metric and experiments showing superior performance.

Authors:Xinyu Ma, Yifeng Xu, Yang Lin, Tianlong Wang, Xu Chu, Xin Gao, Junfeng Zhao, Yasha Wang
Title: DRESSing Up LLM: Efficient Stylized Question-Answering via Style Subspace Editing
Abstract:
We introduce DRESS, a novel approach for generating stylized large language model (LLM) responses through representation editing. Existing methods like prompting and fine-tuning are either insufficient for complex style adaptation or computationally expensive, particularly in tasks like NPC creation or character role-playing. Our approach leverages the over-parameterized nature of LLMs to disentangle a style-relevant subspace within the model's representation space to conduct representation editing, ensuring a minimal impact on the original semantics. By applying adaptive editing strengths, we dynamically adjust the steering vectors in the style subspace to maintain both stylistic fidelity and semantic integrity. We develop two stylized QA benchmark datasets to validate the effectiveness of DRESS, and the results demonstrate significant improvements compared to baseline methods such as prompting and ITI. In short, DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it particularly useful for developing stylized conversational agents. Codes and benchmark datasets are available at https://github.com/ArthurLeoM/DRESS-LLM.
Chinese: DRESS是一种轻量级、无需训练的新方法,通过在大语言模型的表示空间中编辑风格子空间,在保持语义完整性的同时实现灵活有效的风格控制,经新基准数据集验证优于现有基线方法。
English: DRESS is a lightweight, train-free method that enhances large language models by editing representations in a style subspace, achieving superior style control without compromising semantics, as validated on new benchmark datasets.

Authors:Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, Renjie Liao
Title: Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation
Abstract:
Advances in Large Language Models (LLMs) have sparked interest in their ability to solve Olympiad-level math problems. However, the training and evaluation of these models are constrained by the limited size and quality of available datasets, as creating large-scale data for such advanced problems requires extensive effort from human experts. In addition, current benchmarks are prone to contamination, leading to unreliable evaluations. In this paper, we present an automated pipeline that leverages the rich resources of the Art of Problem Solving (AoPS) forum, which predominantly features Olympiad-level problems and community-driven solutions. Using open-source LLMs, we develop a method to extract question-answer pairs from the forum, resulting in AoPS-Instruct, a dataset of more than 600,000 high-quality QA pairs. Our experiments demonstrate that fine-tuning LLMs on AoPS-Instruct improves their reasoning abilities across various benchmarks. Moreover, we build an automatic pipeline that introduces LiveAoPSBench, an evolving evaluation set with timestamps, derived from the latest forum data, providing a contamination-resistant benchmark for assessing LLM performance. Notably, we observe a significant decline in LLM performance over time, suggesting their success on older examples may stem from pre-training exposure rather than true reasoning ability. Our work presents a scalable approach to creating and maintaining large-scale, high-quality datasets for advanced math reasoning, offering valuable insights into the capabilities and limitations of LLMs in this domain. Our benchmark and code is available at https://github.com/DSL-Lab/aops
中文: 本文提出一种利用解题艺术论坛资源的自动化流程,构建了包含60多万高质量数学问答对的AoPS-Instruct数据集,实验表明该数据集能提升大语言模型的推理能力,同时通过带时间戳的动态基准揭示模型性能随时间下降的现象。
English: This paper introduces an automated pipeline using the Art of Problem Solving forum to create AoPS-Instruct, a large-scale dataset of over 600,000 high-quality math QA pairs, which enhances LLMs' reasoning abilities and provides a contamination-resistant benchmark revealing their performance decline over time.

Authors:Yi Zhao, Youzhi Zhang
Title: Siren: A Learning-Based Multi-Turn Attack Framework for Simulating Real-World Human Jailbreak Behaviors
Abstract:
Large language models (LLMs) are widely used in real-world applications, raising concerns about their safety and trustworthiness. While red-teaming with jailbreak prompts exposes the vulnerabilities of LLMs, current efforts focus primarily on single-turn attacks, overlooking the multi-turn strategies used by real-world adversaries. Existing multi-turn methods rely on static patterns or predefined logical chains, failing to account for the dynamic strategies during attacks. We propose Siren, a learning-based multi-turn attack framework designed to simulate real-world human jailbreak behaviors. Siren consists of three stages: (1) training set construction utilizing Turn-Level LLM feedback (Turn-MF), (2) post-training attackers with supervised fine-tuning (SFT) and direct preference optimization (DPO), and (3) interactions between the attacking and target LLMs. Experiments demonstrate that Siren achieves an attack success rate (ASR) of 90% with LLaMA-3-8B as the attacker against Gemini-1.5-Pro as the target model, and 70% with Mistral-7B against GPT-4o, significantly outperforming single-turn baselines. Moreover, Siren with a 7B-scale model achieves performance comparable to a multi-turn baseline that leverages GPT-4o as the attacker, while requiring fewer turns and employing decomposition strategies that are better semantically aligned with attack goals. We hope Siren inspires the development of stronger defenses against advanced multi-turn jailbreak attacks under realistic scenarios. Code is available at https://github.com/YiyiyiZhao/siren. Warning: This paper contains potentially harmful text.
中文:Siren框架通过基于学习的多轮攻击策略模拟真实的人类越狱行为,在对抗Gemini-1.5-Pro和GPT-4o等先进模型时实现了高达90%和70%的攻击成功率,显著优于现有方法。
English: The proposed Siren framework simulates realistic multi-turn jailbreak attacks on large language models by employing learning-based strategies, achieving high success rates against advanced models like Gemini-1.5-Pro and GPT-4o while outperforming existing methods.

Authors:Rong Ye, Yongxin Zhang, Yikai Zhang, Haoyu Kuang, Zhongyu Wei, Peng Sun
Title: Multi-agent KTO: Reinforcing Strategic Interactions of Large Language Model in Language Game
Abstract:
Achieving Artificial General Intelligence (AGI) requires AI agents that can not only make stratigic decisions but also engage in flexible and meaningful communication. Inspired by Wittgenstein's language game theory in Philosophical Investigations, we propose that language agents can learn through in-context interaction rather than traditional multi-stage frameworks that separate decision-making from language expression. Using Werewolf, a social deduction game that tests language understanding, strategic interaction, and adaptability, we develop the Multi-agent Kahneman & Tversky's Optimization (MaKTO). MaKTO engages diverse models in extensive gameplay to generate unpaired desirable and unacceptable responses, then employs KTO to refine the model's decision-making process. In 9-player Werewolf games, MaKTO achieves a 61% average win rate across various models, outperforming GPT-4o and two-stage RL agents by relative improvements of 23.0% and 10.9%, respectively. Notably, MaKTO also demonstrates human-like performance, winning 60% against expert players and showing only 49% detectability in Turing-style blind tests.

Authors:Joshua Davis, Thomas Sounack, Kate Sciacca, Jessie M Brain, Brigitte N Durieux, Nicole D Agaronnik, Charlotta Lindvall
Title: MedSlice: Fine-Tuned Large Language Models for Secure Clinical Note Sectioning
Abstract:
Extracting sections from clinical notes is crucial for downstream analysis but is challenging due to variability in formatting and labor-intensive nature of manual sectioning. While proprietary large language models (LLMs) have shown promise, privacy concerns limit their accessibility. This study develops a pipeline for automated note sectioning using open-source LLMs, focusing on three sections: History of Present Illness, Interval History, and Assessment and Plan. We fine-tuned three open-source LLMs to extract sections using a curated dataset of 487 progress notes, comparing results relative to proprietary models (GPT-4o, GPT-4o mini). Internal and external validity were assessed via precision, recall and F1 score. Fine-tuned Llama 3.1 8B outperformed GPT-4o (F1=0.92). On the external validity test set, performance remained high (F1= 0.85). Fine-tuned open-source LLMs can surpass proprietary models in clinical note sectioning, offering advantages in cost, performance, and accessibility.
中文: 本研究开发了一种基于微调开源大语言模型的临床笔记自动分段流程,在超越专有模型性能的同时,有效解决了隐私保护和可访问性问题。
English: This study develops a pipeline using fine-tuned open-source LLMs to automate clinical note sectioning, demonstrating superior performance over proprietary models while addressing privacy and accessibility concerns.

Authors:Po-Ting Lai, Chih-Hsuan Wei, Shubo Tian, Robert Leaman, Zhiyong Lu
Title: Enhancing Biomedical Relation Extraction with Directionality
Abstract:
Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the develop-ment of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relationships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results in-clude an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks. Our source code and dataset are available at https://github.com/ncbi-nlp/BioREDirect.
Chinese: 本研究通过为BioRED语料库添加实体角色方向性注释,并提出了一个多任务语言模型,在识别关系和发现新知识方面超越了GPT-4和Llama-3等先进模型。
English: This study enhances the BioRED corpus by adding directional annotations for entity roles and introduces a multi-task language model that surpasses advanced models like GPT-4 and Llama-3 in identifying relationships and novel findings.

Authors:Yicheng Tao, Haotian Liu, Shanwen Wang, Hongteng Xu
Title: Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization
Abstract:
Formalized mathematics has recently garnered significant attention for its ability to assist mathematicians across various fields. Premise retrieval, as a common step in mathematical formalization, has been a challenge, particularly for inexperienced users. Existing retrieval methods that facilitate natural language queries require a certain level of mathematical expertise from users, while approaches based on formal languages (e.g., Lean) typically struggle with the scarcity of training data, hindering the training of effective and generalizable retrieval models. In this work, we introduce a novel method that leverages data extracted from Mathlib to train a lightweight and effective premise retrieval model. In particular, the proposed model embeds queries (i.e., proof state provided by Lean) and premises in a latent space, featuring a tokenizer specifically trained on formal corpora. The model is learned in a contrastive learning framework, in which a fine-grained similarity calculation method and a re-ranking module are applied to enhance the retrieval performance. Experimental results demonstrate that our model outperforms existing baselines, achieving higher accuracy while maintaining a lower computational load. We have released an open-source search engine based on our retrieval model at https://premise-search.com/. The source code and the trained model can be found at https://github.com/ruc-ai4math/Premise-Retrieval.
Chinese Summary: 本研究提出了一种新颖的形式化数学前提检索模型,利用Mathlib数据和对比学习框架,在降低计算负载的同时实现了比现有方法更高的检索准确率。
English Summary: This study introduces a novel premise retrieval model for formalized mathematics that uses data from Mathlib and a contrastive learning framework to achieve higher accuracy with lower computational costs compared to existing methods.

Authors:Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Junnan Dong, Yi Chang, Xiao Huang
Title: A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise. Retrieval-Augmented generation (RAG) has emerged as a promising solution to customize LLMs for professional fields by seamlessly integrating external knowledge bases, enabling real-time access to domain-specific expertise during inference. Despite its potential, traditional RAG systems, based on flat text retrieval, face three critical challenges: (i) complex query understanding in professional contexts, (ii) difficulties in knowledge integration across distributed sources, and (iii) system efficiency bottlenecks at scale. This survey presents a systematic analysis of Graph-based Retrieval-Augmented Generation (GraphRAG), a new paradigm that revolutionizes domain-specific LLM applications. GraphRAG addresses traditional RAG limitations through three key innovations: (i) graph-structured knowledge representation that explicitly captures entity relationships and domain hierarchies, (ii) efficient graph-based retrieval techniques that enable context-preserving knowledge retrieval with multihop reasoning ability, and (iii) structure-aware knowledge integration algorithms that leverage retrieved knowledge for accurate and logical coherent generation of LLMs. In this survey, we systematically analyze the technical foundations of GraphRAG and examine current implementations across various professional domains, identifying key technical challenges and promising research directions. All the related resources of GraphRAG, including research papers, open-source data, and projects, are collected for the community in https://github.com/DEEP-PolyU/Awesome-GraphRAG.
Chinese: GraphRAG通过图结构知识表示与检索技术,克服了传统RAG系统的局限,显著提升了专业领域大语言模型应用的推理能力和知识整合效果。
English: GraphRAG overcomes traditional RAG limitations by using graph-structured knowledge representation and retrieval techniques to enhance domain-specific LLM applications with improved reasoning and integration capabilities.

Authors:Zicheng Zhang, Xiangyu Zhao, Xinyu Fang, Chunyi Li, Xiaohong Liu, Xiongkuo Min, Haodong Duan, Kai Chen, Guangtao Zhai
Title: Redundancy Principles for MLLMs Benchmarks
Abstract:
With the rapid iteration of Multi-modality Large Language Models (MLLMs) and the evolving demands of the field, the number of benchmarks produced annually has surged into the hundreds. The rapid growth has inevitably led to significant redundancy among benchmarks. Therefore, it is crucial to take a step back and critically assess the current state of redundancy and propose targeted principles for constructing effective MLLM benchmarks. In this paper, we focus on redundancy from three key perspectives: 1) Redundancy of benchmark capability dimensions, 2) Redundancy in the number of test questions, and 3) Cross-benchmark redundancy within specific domains. Through the comprehensive analysis over hundreds of MLLMs' performance across more than 20 benchmarks, we aim to quantitatively measure the level of redundancy lies in existing MLLM evaluations, provide valuable insights to guide the future development of MLLM benchmarks, and offer strategies to refine and address redundancy issues effectively. The code is available at https://github.com/zzc-1998/Benchmark-Redundancy.
中文: 本文针对多模态大语言模型基准中日益严重的冗余问题,从能力维度、测试题量和跨基准重叠三个关键角度进行分析,旨在提出针对性原则以优化评估体系。
English: This paper addresses the growing redundancy in Multi-modality Large Language Model benchmarks by analyzing three key aspects—capability dimensions, test question volume, and cross-benchmark overlap—to propose targeted principles for more effective evaluation.

Authors:Ziyu Guo, Renrui Zhang, Chengzhuo Tong, Zhizheng Zhao, Rui Huang, Haoquan Zhang, Manyuan Zhang, Jiaming Liu, Shanghang Zhang, Peng Gao, Hongsheng Li, Pheng-Ann Heng
Title: Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step
Abstract:
Chain-of-Thought (CoT) reasoning has been extensively explored in large models to tackle complex understanding tasks. However, it still remains an open question whether such strategies can be applied to verifying and reinforcing image generation scenarios. In this paper, we provide the first comprehensive investigation of the potential of CoT reasoning to enhance autoregressive image generation. We focus on three techniques: scaling test-time computation for verification, aligning model preferences with Direct Preference Optimization (DPO), and integrating these techniques for complementary effects. Our results demonstrate that these approaches can be effectively adapted and combined to significantly improve image generation performance. Furthermore, given the pivotal role of reward models in our findings, we propose the Potential Assessment Reward Model (PARM) and PARM++, specialized for autoregressive image generation. PARM adaptively assesses each generation step through a potential assessment approach, merging the strengths of existing reward models, and PARM++ further introduces a reflection mechanism to self-correct the generated unsatisfactory image, which is the first to incorporate reflection in autoregressive image generation. Using our investigated reasoning strategies, we enhance a baseline model, Show-o, to achieve superior results, with a significant +24% improvement on the GenEval benchmark, surpassing Stable Diffusion 3 by +15%. We hope our study provides unique insights and paves a new path for integrating CoT reasoning with autoregressive image generation. Code and models are released at https://github.com/ZiyuGuo99/Image-Generation-CoT
中文摘要:本研究首次将思维链推理应用于自回归图像生成,通过结合验证扩展、偏好对齐及新型奖励模型(PARM/PARM++),显著提升了生成性能,在基准测试中实现了24%的突破性改进。
English Summary: This study pioneers the application of Chain-of-Thought reasoning to autoregressive image generation, demonstrating that combining verification scaling, preference alignment, and novel reward models (PARM/PARM++) significantly enhances performance, achieving a 24% improvement on benchmarks.

Authors:Jiayi Lei, Renrui Zhang, Xiangfei Hu, Weifeng Lin, Zhen Li, Wenjian Sun, Ruoyi Du, Le Zhuo, Zhongyu Li, Xinyue Li, Shitian Zhao, Ziyu Guo, Yiting Lu, Peng Gao, Hongsheng Li
Title: IMAGINE-E: Image Generation Intelligence Evaluation of State-of-the-art Text-to-Image Models
Abstract:
With the rapid development of diffusion models, text-to-image(T2I) models have made significant progress, showcasing impressive abilities in prompt following and image generation. Recently launched models such as FLUX.1 and Ideogram2.0, along with others like Dall-E3 and Stable Diffusion 3, have demonstrated exceptional performance across various complex tasks, raising questions about whether T2I models are moving towards general-purpose applicability. Beyond traditional image generation, these models exhibit capabilities across a range of fields, including controllable generation, image editing, video, audio, 3D, and motion generation, as well as computer vision tasks like semantic segmentation and depth estimation. However, current evaluation frameworks are insufficient to comprehensively assess these models' performance across expanding domains. To thoroughly evaluate these models, we developed the IMAGINE-E and tested six prominent models: FLUX.1, Ideogram2.0, Midjourney, Dall-E3, Stable Diffusion 3, and Jimeng. Our evaluation is divided into five key domains: structured output generation, realism, and physical consistency, specific domain generation, challenging scenario generation, and multi-style creation tasks. This comprehensive assessment highlights each model's strengths and limitations, particularly the outstanding performance of FLUX.1 and Ideogram2.0 in structured and specific domain tasks, underscoring the expanding applications and potential of T2I models as foundational AI tools. This study provides valuable insights into the current state and future trajectory of T2I models as they evolve towards general-purpose usability. Evaluation scripts will be released at https://github.com/jylei16/Imagine-e.
中文:随着FLUX.1和Ideogram2.0等扩散模型的突破,文本到图像模型在多领域展现出通用化潜力,但现有评估体系尚不完善,为此开发的IMAGINE-E基准通过五大关键维度对六款主流模型进行了系统评估,揭示了其作为基础AI工具的发展前景。
English: Recent advances in diffusion-based text-to-image models like FLUX.1 and Ideogram2.0 demonstrate expanding capabilities across multiple domains, though current evaluation frameworks remain inadequate for comprehensive assessment, prompting the development of the IMAGINE-E benchmark to systematically evaluate six leading models across five key domains.

Authors:Rui Li, Xiaohan Wang, Yuhui Zhang, Orr Zohar, Zeyu Wang, Serena Yeung-Levy
Title: Temporal Preference Optimization for Long-Form Video Understanding
Abstract:
Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.

Authors:Dan Zhang, Tao Feng, Lilong Xue, Yuandong Wang, Yuxiao Dong, Jie Tang
Title: Parameter-Efficient Fine-Tuning for Foundation Models
Abstract:
This survey delves into the realm of Parameter-Efficient Fine-Tuning (PEFT) within the context of Foundation Models (FMs). PEFT, a cost-effective fine-tuning technique, minimizes parameters and computational complexity while striving for optimal downstream task performance. FMs, like ChatGPT, DALL-E, and LLaVA specialize in language understanding, generative tasks, and multimodal tasks, trained on diverse datasets spanning text, images, and videos. The diversity of FMs guides various adaptation strategies for PEFT. Therefore, this survey aims to provide a comprehensive overview of PEFT techniques applied to diverse FMs and address critical gaps in understanding the techniques, trends, and applications. We start by providing a detailed development of FMs and PEFT. Subsequently, we systematically review the key categories and core mechanisms of PEFT across diverse FMs to offer a comprehensive understanding of trends. We also explore the most recent applications across various FMs to demonstrate the versatility of PEFT, shedding light on the integration of systematic PEFT methods with a range of FMs. Furthermore, we identify potential research and development directions for improving PEFTs in the future. This survey provides a valuable resource for both newcomers and experts seeking to understand and use the power of PEFT across FMs. All reviewed papers are listed at \url{https://github.com/THUDM/Awesome-Parameter-Efficient-Fine-Tuning-for-Foundation-Models}.
本综述全面探讨了基础模型的参数高效微调技术,系统分析了其核心机制、应用场景与发展趋势,为相关研究者提供了重要参考。
This survey comprehensively reviews parameter-efficient fine-tuning techniques for foundation models, analyzing their mechanisms, applications, and future research directions to serve as a valuable resource for researchers.

Authors:Xin Xu, Jiaxin Zhang, Tianhao Chen, Zitong Chao, Jishan Hu, Can Yang
Title: UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models
Abstract:
Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking extensive coverage of undergraduate-level mathematical problems or probably suffering from test-set contamination. To address these issues, we introduce UGMathBench, a diverse and dynamic benchmark specifically designed for evaluating undergraduate-level mathematical reasoning with LLMs. UGMathBench comprises 5,062 problems across 16 subjects and 111 topics, featuring 10 distinct answer types. Each problem includes three randomized versions, with additional versions planned for release as leading open-source LLMs become saturated in UGMathBench. Furthermore, we propose two key metrics: effective accuracy (EAcc), which measures the percentage of correctly solved problems across all three versions, and reasoning gap ($Δ$), which assesses reasoning robustness by calculating the difference between the average accuracy across all versions and EAcc. Our extensive evaluation of 23 leading LLMs reveals that the highest EAcc achieved is 56.3\% by OpenAI-o1-mini, with large $Δ$ values observed across different models. This highlights the need for future research aimed at developing "large reasoning models" with high EAcc and $Δ= 0$. We anticipate that the release of UGMathBench, along with its detailed evaluation codes, will serve as a valuable resource to advance the development of LLMs in solving mathematical problems. Codes and data are available at https://github.com/YangLabHKUST/UGMathBench
中文: UGMathBench被提出作为一个全面的基准测试,旨在评估大型语言模型在本科数学推理上的能力,通过多样化问题和有效准确率、推理差距等新指标弥补现有不足,评估结果显示模型性能仍有很大提升空间。
English: UGMathBench is introduced as a comprehensive benchmark to evaluate LLMs' undergraduate-level mathematical reasoning, addressing gaps in existing benchmarks through diverse problems and new metrics like effective accuracy and reasoning gap, with evaluations revealing significant room for improvement in model performance.

Authors:Zhaoxuan Tan, Zinan Zeng, Qingkai Zeng, Zhenyu Wu, Zheyuan Liu, Fengran Mo, Meng Jiang
Title: Can Large Language Models Understand Preferences in Personalized Recommendation?
Abstract:
Large Language Models (LLMs) excel in various tasks, including personalized recommendations. Existing evaluation methods often focus on rating prediction, relying on regression errors between actual and predicted ratings. However, user rating bias and item quality, two influential factors behind rating scores, can obscure personal preferences in user-item pair data. To address this, we introduce PerRecBench, disassociating the evaluation from these two factors and assessing recommendation techniques on capturing the personal preferences in a grouped ranking manner. We find that the LLM-based recommendation techniques that are generally good at rating prediction fail to identify users' favored and disfavored items when the user rating bias and item quality are eliminated by grouping users. With PerRecBench and 19 LLMs, we find that while larger models generally outperform smaller ones, they still struggle with personalized recommendation. Our findings reveal the superiority of pairwise and listwise ranking approaches over pointwise ranking, PerRecBench's low correlation with traditional regression metrics, the importance of user profiles, and the role of pretraining data distributions. We further explore three supervised fine-tuning strategies, finding that merging weights from single-format training is promising but improving LLMs' understanding of user preferences remains an open research problem. Code and data are available at https://github.com/TamSiuhin/PerRecBench
中文摘要:PerRecBench通过消除用户评分偏差和物品质量影响来评估基于大语言模型的推荐技术,发现尽管较大模型表现更优,但在个性化推荐上仍有困难,且排序方法优于逐点评分法。
English Summary: PerRecBench is introduced to evaluate LLM-based recommendation techniques by isolating user rating bias and item quality, revealing that while larger models perform better, they still struggle with personalized recommendations and that ranking approaches are superior to pointwise methods.

Authors:Zhiyuan Weng, Guikun Chen, Wenguan Wang
Title: Do as We Do, Not as You Think: the Conformity of Large Language Models
Abstract:
Recent advancements in large language models (LLMs) revolutionize the field of intelligent agents, enabling collaborative multi-agent systems capable of tackling complex problems across various domains. However, the potential of conformity within these systems, analogous to phenomena like conformity bias and groupthink in human group dynamics, remains largely unexplored, raising concerns about their collective problem-solving capabilities and possible ethical implications. This paper presents a comprehensive study on conformity in LLM-driven multi-agent systems, focusing on three aspects: the existence of conformity, the factors influencing conformity, and potential mitigation strategies. In particular, we introduce BenchForm, a new conformity-oriented benchmark, featuring reasoning-intensive tasks and five distinct interaction protocols designed to probe LLMs' behavior in collaborative scenarios. Several representative LLMs are evaluated on BenchForm, using metrics such as conformity rate and independence rate to quantify conformity's impact. Our analysis delves into factors influencing conformity, including interaction time and majority size, and examines how the subject agent rationalizes its conforming behavior. Furthermore, we explore two strategies to mitigate conformity effects, i.e., developing enhanced personas and implementing a reflection mechanism. Several interesting findings regarding LLMs' conformity are derived from empirical results and case studies. We hope that these insights can pave the way for more robust and ethically-aligned collaborative AI systems. Our benchmark and code are available at BenchForm.
中文摘要:本研究通过开发BenchForm基准测试,系统探讨了大型语言模型驱动的多智能体系统中的从众现象,分析了其存在性、影响因素及通过增强角色设定和反思机制等缓解策略。
English Summary: This study investigates conformity in LLM-driven multi-agent systems, developing the BenchForm benchmark to analyze its existence, influencing factors, and mitigation strategies like enhanced personas and reflection mechanisms.

Authors:Joshua Park, Yongfeng Zhang
Title: AgentRec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback
Abstract:
Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at https://github.com/joshprk/agentrec.
Chinese: 该研究提出了一种基于Sentence-BERT的新架构,通过自然语言提示推荐最适合任务的LLM代理,实现了92.2%的Top-1准确率,并通过强化学习提供了可解释性和适应性。
English: The study introduces a novel architecture using Sentence-BERT to recommend the most suitable LLM agent for tasks based on natural language prompts, achieving 92.2% top-1 accuracy efficiently and offering interpretability and adaptability through reinforcement learning.

Authors:Yang Bai, Christan Earl Grant, Daisy Zhe Wang
Title: RAMQA: A Unified Framework for Retrieval-Augmented Multi-Modal Question Answering
Abstract:
Multi-modal retrieval-augmented Question Answering (MRAQA), integrating text and images, has gained significant attention in information retrieval (IR) and natural language processing (NLP). Traditional ranking methods rely on small encoder-based language models, which are incompatible with modern decoder-based generative large language models (LLMs) that have advanced various NLP tasks. To bridge this gap, we propose RAMQA, a unified framework combining learning-to-rank methods with generative permutation-enhanced ranking techniques. We first train a pointwise multi-modal ranker using LLaVA as the backbone. Then, we apply instruction tuning to train a LLaMA model for re-ranking the top-k documents using an innovative autoregressive multi-task learning approach. Our generative ranking model generates re-ranked document IDs and specific answers from document candidates in various permutations. Experiments on two MRAQA benchmarks, WebQA and MultiModalQA, show significant improvements over strong baselines, highlighting the effectiveness of our approach. Code and data are available at: https://github.com/TonyBY/RAMQA
中文摘要:RAMQA框架通过结合排序学习方法和生成式重排技术,弥合了传统排序模型与大型语言模型之间的差距,在多模态问答基准测试中实现了显著性能提升。
English Summary: The proposed RAMQA framework integrates learning-to-rank methods with generative permutation techniques to bridge the gap between traditional ranking models and modern LLMs, achieving significant performance improvements on multi-modal QA benchmarks.

Authors:Bohao Yang, Yingji Zhang, Dong Liu, André Freitas, Chenghua Lin
Title: Does Table Source Matter? Benchmarking and Improving Multimodal Scientific Table Understanding and Reasoning
Abstract:
Recent large language models (LLMs) have advanced table understanding capabilities but rely on converting tables into text sequences. While multimodal large language models (MLLMs) enable direct visual processing, they face limitations in handling scientific tables due to fixed input image resolutions and insufficient numerical reasoning capabilities. We present a comprehensive framework for multimodal scientific table understanding and reasoning with dynamic input image resolutions. Our framework consists of three key components: (1) MMSci-Pre, a domain-specific table structure learning dataset of 52K scientific table structure recognition samples, (2) MMSci-Ins, an instruction tuning dataset with 12K samples across three table-based tasks, and (3) MMSci-Eval, a benchmark with 3,114 testing samples specifically designed to evaluate numerical reasoning capabilities. Extensive experiments demonstrate that our domain-specific approach with 52K scientific table images achieves superior performance compared to 150K general-domain tables, highlighting the importance of data quality over quantity. Our proposed table-based MLLMs with dynamic input resolutions show significant improvements in both general table understanding and numerical reasoning capabilities, with strong generalisation to held-out datasets. Our code and data are publicly available at https://github.com/Bernard-Yang/MMSci_Table.
中文: 本研究提出了一种动态分辨率的多模态框架,通过领域专用数据集提升科学表格理解能力,实验证明以质量为导向的训练方法优于传统大规模数据训练。
English: The study introduces a multimodal framework with dynamic image resolution to enhance scientific table understanding, featuring domain-specific datasets and demonstrating superior performance through quality-focused training.

Authors:Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
Title: Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
Abstract:
Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.
Chinese: 本文提出的测试时偏好优化(TPO)框架,通过在推理过程中将奖励信号转化为文本评价来对齐大语言模型与人类偏好,无需重新训练模型。
English: This paper introduces Test-time Preference Optimization (TPO), a framework that aligns large language models with human preferences during inference by converting reward signals into textual critiques, eliminating the need for model retraining.

Authors:Viktor Moskvoretskii, Maria Lysyuk, Mikhail Salnikov, Nikolay Ivanov, Sergey Pletenev, Daria Galimzianova, Nikita Krayko, Vasily Konovalov, Irina Nikishina, Alexander Panchenko
Title: Adaptive Retrieval Without Self-Knowledge? Bringing Uncertainty Back Home
Abstract:
Retrieval Augmented Generation (RAG) improves correctness of Question Answering (QA) and addresses hallucinations in Large Language Models (LLMs), yet greatly increase computational costs. Besides, RAG is not always needed as may introduce irrelevant information. Recent adaptive retrieval methods integrate LLMs' intrinsic knowledge with external information appealing to LLM self-knowledge, but they often neglect efficiency evaluations and comparisons with uncertainty estimation techniques. We bridge this gap by conducting a comprehensive analysis of 35 adaptive retrieval methods, including 8 recent approaches and 27 uncertainty estimation techniques, across 6 datasets using 10 metrics for QA performance, self-knowledge, and efficiency. Our findings show that uncertainty estimation techniques often outperform complex pipelines in terms of efficiency and self-knowledge, while maintaining comparable QA performance.
中文摘要:对RAG系统中自适应检索方法的全面评估表明,不确定性估计技术在效率和自我认知方面常优于复杂流程,同时保持相当的问答性能。
English Summary: Adaptive retrieval methods in RAG systems are comprehensively evaluated, revealing that uncertainty estimation techniques often surpass complex pipelines in efficiency and self-knowledge while maintaining similar QA performance.

Authors:Sunbowen Lee, Junting Zhou, Chang Ao, Kaige Li, Xinrun Du, Sirui He, Haihong Wu, Tianci Liu, Jiaheng Liu, Hamid Alinejad-Rokny, Min Yang, Yitao Liang, Zhoufutu Wen, Shiwen Ni
Title: Quantification of Large Language Model Distillation
Abstract:
Model distillation is a fundamental technique in building large language models (LLMs), transferring knowledge from a teacher model to a student model. However, distillation can lead to model homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs' robustness and safety. The code and data are available under https://github.com/Aegis1863/LLMs-Distillation-Quantification.
中文: 本研究提出一个量化大语言模型蒸馏过程的框架,通过分析身份认知矛盾和多粒度响应相似性,发现除Claude、豆包和Gemini外多数模型呈现高蒸馏度,并呼吁加强独立开发和透明度以提升模型的鲁棒性与安全性。
English: This study introduces a framework to quantify model distillation in large language models, focusing on identity cognition contradictions and multi-granularity response similarities, revealing high distillation degrees in most models except Claude, Doubao, and Gemini, and advocating for more independent development and transparent reporting to enhance model robustness and safety.

Authors:Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao
Title: T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation
Abstract:
Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.
中文: T2ISafety作为一个全面的安全基准,通过评估文本到图像模型在毒性、公平性和偏见方面的表现,涵盖12项任务和44个类别,利用7万条提示和6.8万张标注图像,揭示了12种扩散模型中存在的种族偏见和有害内容生成等风险。
English: T2ISafety is a comprehensive benchmark addressing safety gaps in text-to-image models by evaluating toxicity, fairness, and bias across 12 tasks and 44 categories, using 70K prompts and 68K annotated images to reveal risks like racial bias and toxic content generation in 12 diffusion models.

Authors:Haotian Luo, Li Shen, Haiying He, Yibo Wang, Shiwei Liu, Wei Li, Naiqiang Tan, Xiaochun Cao, Dacheng Tao
Title: O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
Abstract:
Recently, long-thought reasoning LLMs, such as OpenAI's O1, adopt extended reasoning processes similar to how humans ponder over complex problems. This reasoning paradigm significantly enhances the model's problem-solving abilities and has achieved promising results. However, long-thought reasoning process leads to a substantial increase in inference time. A pressing challenge is reducing the inference overhead of long-thought LLMs while ensuring accuracy. In this paper, we experimentally demonstrate that long-thought reasoning models struggle to effectively allocate token budgets based on problem difficulty and reasoning redundancies. To address this, we propose Length-Harmonizing Fine-Tuning (O1-Pruner), aiming at minimizing reasoning overhead while maintaining accuracy. This effective fine-tuning method first estimates the LLM's baseline performance through pre-sampling and then uses RL-style fine-tuning to encourage the model to generate shorter reasoning processes under accuracy constraints. This allows the model to achieve efficient reasoning with lower redundancy while maintaining accuracy. Experiments on various mathematical reasoning benchmarks show that O1-Pruner not only significantly reduces inference overhead but also achieves higher accuracy, providing a novel and promising solution to this challenge. Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner
中文:近期长思维推理大模型虽提升解题能力却显著增加推理耗时,为此提出的O1-Pruner微调方法通过优化推理长度,在保证精度的同时大幅降低计算开销。
English: Recent long-thought reasoning LLMs enhance problem-solving but increase inference time, prompting the development of O1-Pruner, a fine-tuning method that reduces reasoning redundancy while maintaining or improving accuracy.

Authors:Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang
Title: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Abstract:
Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-Reward
Chinese: 针对大型视觉语言模型公开多模态奖励模型稀缺的问题,我们提出了InternLM-XComposer2.5-Reward这一简单高效的多模态奖励模型,通过在多样化多模态偏好数据上训练,该模型在基准测试中表现优异,并成功应用于强化学习训练和响应选择等关键场景。
English: To address the scarcity of public multi-modal reward models for Large Vision Language Models (LVLMs), we introduce InternLM-XComposer2.5-Reward, a simple yet effective model trained on a diverse multi-modal preference corpus, which achieves top performance on benchmarks and enables key applications like RL training and response selection.

Authors:Maosong Cao, Taolin Zhang, Mo Li, Chuyu Zhang, Yunxin Liu, Haodong Duan, Songyang Zhang, Kai Chen
Title: Condor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement
Abstract:
The quality of Supervised Fine-Tuning (SFT) data plays a critical role in enhancing the conversational capabilities of Large Language Models (LLMs). However, as LLMs become more advanced, the availability of high-quality human-annotated SFT data has become a significant bottleneck, necessitating a greater reliance on synthetic training data. In this work, we introduce Condor, a novel two-stage synthetic data generation framework that incorporates World Knowledge Tree and Self-Reflection Refinement to produce high-quality SFT data at scale. Our experimental results demonstrate that a base model fine-tuned on only 20K Condor-generated samples achieves superior performance compared to counterparts. The additional refinement stage in Condor further enables iterative self-improvement for LLMs at various scales (up to 72B), validating the effectiveness of our approach. Furthermore, our investigation into the scaling for synthetic data in post-training reveals substantial unexplored potential for performance improvements, opening promising avenues for future research.
中文: Condor框架通过世界知识树和自我反思优化两阶段方法生成高质量合成SFT数据,仅需2万样本微调的基础模型即可超越同类模型,同时展现出可扩展的自我改进能力,为后续研究开辟了新方向。
English: The Condor framework introduces a novel two-stage approach using World Knowledge Tree and Self-Reflection Refinement to generate high-quality synthetic SFT data, enabling LLMs fine-tuned with just 20K samples to outperform counterparts while demonstrating scalable self-improvement capabilities.

Authors:Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves
Title: CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Abstract:
The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the final disease prediction on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

Authors:Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
Title: PAINT: Paying Attention to INformed Tokens to Mitigate Hallucination in Large Vision-Language Model
Abstract:
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities in understanding and describing visual content, achieving state-of-the-art performance across various vision-language tasks. However, these models often generate descriptions containing objects or details that are absent in the input image, a phenomenon commonly known as hallucination. Our work investigates the key reasons behind this issue by analyzing the pattern of self-attention in transformer layers. We find that hallucinations often arise from the progressive weakening of attention weight to visual tokens in the deeper layers of the LLM. Some previous works naively boost the attention of all visual tokens to mitigate this issue, resulting in suboptimal hallucination reduction. To address this, we identify two critical sets of visual tokens that facilitate the transfer of visual information from the vision encoder to the LLM. Local tokens encode grounded information about objects present in an image, while summary tokens capture the overall aggregated representation of the image. Importantly, these two sets of tokens require different levels of weight enhancement. To this end, we propose \textbf{PAINT} (\textbf{P}aying \textbf{A}ttention to \textbf{IN}formed \textbf{T}okens), a plug-and-play framework that intervenes in the self-attention mechanism of the LLM, selectively boosting the attention weights of local and summary tokens with experimentally learned margins. Evaluation on the MSCOCO image captioning dataset demonstrate that our approach reduces hallucination rates by up to 62.3\% compared to baseline models while maintaining accuracy. Code is available at \href{https://github.com/hasanar1f/PAINT}{https://github.com/hasanar1f/PAINT}
中文:PAINT框架通过选择性增强局部和摘要视觉标记的注意力权重,将LVLM的幻觉率降低高达62.3%,同时保持准确性。
English: The PAINT framework selectively enhances attention weights for local and summary visual tokens in LVLMs to reduce hallucinations by up to 62.3% while preserving accuracy.

Authors:Stefan Lenz, Arsenij Ustjanzew, Marco Jeray, Meike Ressing, Torsten Panholzer
Title: Can open source large language models be used for tumor documentation in Germany? -- An evaluation on urological doctors' notes
Abstract:
Tumor documentation in Germany is largely done manually, requiring reading patient records and entering data into structured databases. Large language models (LLMs) could potentially enhance this process by improving efficiency and reliability. This evaluation tests eleven different open source LLMs with sizes ranging from 1-70 billion model parameters on three basic tasks of the tumor documentation process: identifying tumor diagnoses, assigning ICD-10 codes, and extracting the date of first diagnosis. For evaluating the LLMs on these tasks, a dataset of annotated text snippets based on anonymized doctors' notes from urology was prepared. Different prompting strategies were used to investigate the effect of the number of examples in few-shot prompting and to explore the capabilities of the LLMs in general. The models Llama 3.1 8B, Mistral 7B, and Mistral NeMo 12 B performed comparably well in the tasks. Models with less extensive training data or having fewer than 7 billion parameters showed notably lower performance, while larger models did not display performance gains. Examples from a different medical domain than urology could also improve the outcome in few-shot prompting, which demonstrates the ability of LLMs to handle tasks needed for tumor documentation. Open source LLMs show a strong potential for automating tumor documentation. Models from 7-12 billion parameters could offer an optimal balance between performance and resource efficiency. With tailored fine-tuning and well-designed prompting, these models might become important tools for clinical documentation in the future. The code for the evaluation is available from https://github.com/stefan-m-lenz/UroLlmEval. We also release the dataset as a new valuable resource that addresses the shortage of authentic and easily accessible benchmarks in German-language medical NLP.
中文摘要:本研究评估了11种开源大语言模型在德国肿瘤文档自动化任务中的表现,发现70-120亿参数的模型(如Llama 3.1 8B和Mistral 7B)在性能与资源效率间达到最佳平衡,通过针对性提示策略展现出临床应用的巨大潜力。
English Summary: This study evaluates eleven open-source large language models for automating tumor documentation tasks in Germany, finding that models with 7-12 billion parameters like Llama 3.1 8B and Mistral 7B offer optimal performance-resource balance while demonstrating strong potential for clinical use through tailored prompting strategies.

Authors:Hamid Nasiri, Peter Garraghan
Title: EDoRA: Efficient Weight-Decomposed Low-Rank Adaptation via Singular Value Decomposition
Abstract:
Parameter-efficient fine-tuning methods, such as LoRA, reduces the number of trainable parameters. However, they often suffer from scalability issues and differences between their learning pattern and full fine-tuning. To overcome these limitations, we propose Efficient Weight-Decomposed Low-Rank Adaptation (EDoRA): a novel PEFT method that decomposes pre-trained weights into magnitude and directional components. By freezing low-rank matrices, initializing them by singular value decomposition, and introducing a small trainable matrix between them, EDoRA achieves substantial reduction in trainable parameters while maintaining learning capacity. Experimental results on the GLUE benchmark demonstrate that EDoRA achieves competitive or superior performance compared to state-of-the-art methods, such as LoRA and DoRA, with up to 30x fewer trainable parameters. This makes EDoRA a highly efficient solution for adapting LLMs to diverse tasks under memory-constrained settings. Code is available at https://github.com/Hamid-Nasiri/EDoRA .
Chinese: 提出的EDoRA方法通过分解权重并采用低秩适应,有效减少可训练参数,在性能上与LoRA等现有方法相当甚至更优,同时参数数量减少高达30倍。
English: The proposed EDoRA method efficiently reduces trainable parameters by decomposing weights and using low-rank adaptations, achieving competitive performance with up to 30x fewer parameters than existing methods like LoRA.

Authors:Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun
Title: EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents
Abstract:
Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at https://github.com/thunlp/EmbodiedEval.
Chinese: EmbodiedEval 是一个全面的评估基准,通过125个多样化3D场景中的328项交互任务来测试多模态大语言模型,揭示了其在具身能力方面相比人类表现存在的显著不足。
English: EmbodiedEval is a comprehensive benchmark designed to assess Multimodal Large Language Models (MLLMs) through 328 interactive tasks across 125 diverse 3D scenes, revealing their significant limitations in embodied capabilities compared to human performance.

Authors:Zhenhailong Wang, Haiyang Xu, Junyang Wang, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Heng Ji
Title: Mobile-Agent-E: Self-Evolving Mobile Assistant for Complex Tasks
Abstract:
Smartphones have become indispensable in modern life, yet navigating complex tasks on mobile devices often remains frustrating. Recent advancements in large multimodal model (LMM)-based mobile agents have demonstrated the ability to perceive and act in mobile environments. However, current approaches face significant limitations: they fall short in addressing real-world human needs, struggle with reasoning-intensive and long-horizon tasks, and lack mechanisms to learn and improve from prior experiences. To overcome these challenges, we introduce Mobile-Agent-E, a hierarchical multi-agent framework capable of self-evolution through past experience. By hierarchical, we mean an explicit separation of high-level planning and low-level action execution. The framework comprises a Manager, responsible for devising overall plans by breaking down complex tasks into subgoals, and four subordinate agents--Perceptor, Operator, Action Reflector, and Notetaker--which handle fine-grained visual perception, immediate action execution, error verification, and information aggregation, respectively. Mobile-Agent-E also features a novel self-evolution module which maintains a persistent long-term memory comprising Tips and Shortcuts. Tips are general guidance and lessons learned from prior tasks on how to effectively interact with the environment. Shortcuts are reusable, executable sequences of atomic operations tailored for specific subroutines. The inclusion of Tips and Shortcuts facilitates continuous refinement in performance and efficiency. Alongside this framework, we introduce Mobile-Eval-E, a new benchmark featuring complex mobile tasks requiring long-horizon, multi-app interactions. Empirical results show that Mobile-Agent-E achieves a 22% absolute improvement over previous state-of-the-art approaches across three foundation model backbones. Project page: https://x-plug.github.io/MobileAgent.
中文: Mobile-Agent-E是一种分层多智能体框架,通过长期记忆和专门下属代理实现自我进化,解决了现有移动代理的不足,性能比先前最优方法提升了22%。
English: Mobile-Agent-E is a hierarchical multi-agent framework that addresses limitations in current mobile agents by enabling self-evolution through long-term memory and specialized subordinate agents, achieving a 22% performance improvement over prior methods.

Authors:Saeid Asgari Taghanaki, Joao Monteiro
Title: Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy
Abstract:
Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at https://github.com/asgsaeid/EQT.
中文: 大型语言模型能生成详细解释但缺乏真正理解,通过解释-查询-测试方法揭示其解释能力与推理能力之间存在差距。
English: Large language models can generate detailed explanations but lack true comprehension, as shown by the Explain-Query-Test method, which reveals a gap between their explanatory and reasoning abilities.

Authors:Haoran Sun, Yekun Chai, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Title: Curiosity-Driven Reinforcement Learning from Human Feedback
Abstract:
Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We make our code publicly available at https://github.com/ernie-research/CD-RLHF.
中文: 提出的CD-RLHF框架通过将好奇心驱动的内在奖励与传统RLHF相结合,在保持与人类偏好对齐的同时,显著提升了语言模型输出多样性,并在多项任务中验证了其有效性。
English: The proposed CD-RLHF framework enhances output diversity in large language models by integrating curiosity-driven intrinsic rewards with traditional RLHF, achieving improved diversity while maintaining alignment with human preferences across various tasks.

Authors:Sahar Tahmasebi, David Ernst, Eric Müller-Budack, Ralph Ewerth
Title: Verifying Cross-modal Entity Consistency in News using Vision-language Models
Abstract:
The web has become a crucial source of information, but it is also used to spread disinformation, often conveyed through multiple modalities like images and text. The identification of inconsistent cross-modal information, in particular entities such as persons, locations, and events, is critical to detect disinformation. Previous works either identify out-of-context disinformation by assessing the consistency of images to the whole document, neglecting relations of individual entities, or focus on generic entities that are not relevant to news. So far, only few approaches have addressed the task of validating entity consistency between images and text in news. However, the potential of large vision-language models (LVLMs) has not been explored yet. In this paper, we propose an LVLM-based framework for verifying Cross-modal Entity Consistency~(LVLM4CEC), to assess whether persons, locations and events in news articles are consistent across both modalities. We suggest effective prompting strategies for LVLMs for entity verification that leverage reference images crawled from web. Moreover, we extend three existing datasets for the task of entity verification in news providing manual ground-truth data. Our results show the potential of LVLMs for automating cross-modal entity verification, showing improved accuracy in identifying persons and events when using evidence images. Moreover, our method outperforms a baseline for location and event verification in documents. The datasets and source code are available on GitHub at https://github.com/TIBHannover/LVLM4CEC.
中文: 本文提出LVLM4CEC框架,利用大型视觉语言模型验证新闻中人物、地点和事件在图文模态间的一致性,通过网页抓取的参考图像提升了验证准确性,并在多个实体类型上优于基线方法。
English: This paper introduces LVLM4CEC, a framework using large vision-language models to verify the consistency of entities like persons, locations, and events across images and text in news, demonstrating improved accuracy with web-crawled reference images and outperforming baselines.

Authors:Daisuke Kikuta, Hiroki Ikeuchi, Kengo Tajiri
Title: ChaosEater: Fully Automating Chaos Engineering with Large Language Models
Abstract:
Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools implement the automated execution of predefined CE experiments. However, defining these experiments and improving the system based on the experimental results still remain manual. To reduce the costs of the manual operations, we propose ChaosEater, a system for automating the entire CE operations with Large Language Models (LLMs). It predefines the agentic workflow according to a systematic CE cycle and assigns subdivided operations within the workflow to LLMs. ChaosEater targets CE for Kubernetes systems, which are managed through code (i.e., Infrastructure as Code). Therefore, the LLMs in ChaosEater perform software engineering tasks to complete CE cycles, including requirement definition, code generation, debugging, and testing. We evaluate ChaosEater through case studies on both small and large Kubernetes systems. The results demonstrate that it stably completes reasonable single CE cycles with significantly low time and monetary costs. The CE cycles are also qualitatively validated by human engineers and LLMs.

Authors:Elad Levi, Ilan Kadar
Title: IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems
Abstract:
Large Language Models (LLMs) are transforming artificial intelligence, evolving into task-oriented systems capable of autonomous planning and execution. One of the primary applications of LLMs is conversational AI systems, which must navigate multi-turn dialogues, integrate domain-specific APIs, and adhere to strict policy constraints. However, evaluating these agents remains a significant challenge, as traditional methods fail to capture the complexity and variability of real-world interactions. We introduce IntellAgent, a scalable, open-source multi-agent framework designed to evaluate conversational AI systems comprehensively. IntellAgent automates the creation of diverse, synthetic benchmarks by combining policy-driven graph modeling, realistic event generation, and interactive user-agent simulations. This innovative approach provides fine-grained diagnostics, addressing the limitations of static and manually curated benchmarks with coarse-grained metrics. IntellAgent represents a paradigm shift in evaluating conversational AI. By simulating realistic, multi-policy scenarios across varying levels of complexity, IntellAgent captures the nuanced interplay of agent capabilities and policy constraints. Unlike traditional methods, it employs a graph-based policy model to represent relationships, likelihoods, and complexities of policy interactions, enabling highly detailed diagnostics. IntellAgent also identifies critical performance gaps, offering actionable insights for targeted optimization. Its modular, open-source design supports seamless integration of new domains, policies, and APIs, fostering reproducibility and community collaboration. Our findings demonstrate that IntellAgent serves as an effective framework for advancing conversational AI by addressing challenges in bridging research and deployment. The framework is available at https://github.com/plurai-ai/intellagent
中文摘要:IntellAgent是一个开源的多智能体框架,通过模拟真实的多策略场景并生成精细化诊断,自动创建合成基准来全面评估对话式人工智能系统。
English Summary: IntellAgent is an open-source multi-agent framework that automates the creation of synthetic benchmarks to comprehensively evaluate conversational AI systems by simulating realistic multi-policy scenarios and providing fine-grained diagnostics.

Authors:Sani Abdullahi Sani, Shamsuddeen Hassan Muhammad, Devon Jarvis
Title: Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa
Abstract:
Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by ~identifying sentiments expressed in text. Although significant advances have been made in SA for widely spoken languages, low-resource languages such as Hausa face unique challenges, primarily due to a lack of digital resources. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa. We first curate a diverse, unlabeled corpus to expand the model's linguistic capabilities, followed by applying LAFT to adapt AfriBERTa specifically to the nuances of the Hausa language. The adapted model is then fine-tuned on the labeled NaijaSenti sentiment dataset to evaluate its performance. Our findings demonstrate that LAFT gives modest improvements, which may be attributed to the use of formal Hausa text rather than informal social media data. Nevertheless, the pre-trained AfriBERTa model significantly outperformed models not specifically trained on Hausa, highlighting the importance of using pre-trained models in low-resource contexts. This research emphasizes the necessity for diverse data sources to advance NLP applications for low-resource African languages. We published the code and the dataset to encourage further research and facilitate reproducibility in low-resource NLP here: https://github.com/Sani-Abdullahi-Sani/Natural-Language-Processing/blob/main/Sentiment%20Analysis%20for%20Low%20Resource%20African%20Languages
中文: 本研究采用语言自适应微调技术提升豪萨语情感分析效果,在正式文本中取得适度改进,同时证明AfriBERTa模型在低资源语言环境中的优越性,并强调非洲语言自然语言处理需要多样化数据支持。
English: This study applies Language-Adaptive Fine-Tuning to enhance sentiment analysis for Hausa, showing modest gains with formal texts while demonstrating AfriBERTa's superiority in low-resource settings and emphasizing the need for diverse data in African language NLP.

Authors:Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, Akim Tsvigun, Vladislav Mikhailov, Rui Xing, Zhuohan Xie, Jiahui Geng, Giovanni Puccetti, Ekaterina Artemova, Jinyan Su, Minh Ngoc Ta, Mervat Abassy, Kareem Ashraf Elozeiri, Saad El Dine Ahmed El Etter, Maiya Goloburda, Tarek Mahmoud, Raj Vardhan Tomar, Nurkhan Laiyk, Osama Mohammed Afzal, Ryuto Koike, Masahiro Kaneko, Alham Fikri Aji, Nizar Habash, Iryna Gurevych, Preslav Nakov
Title: GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Abstract:
We present the GenAI Content Detection Task~1 -- a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 26 teams -- to the Multilingual. We provide a comprehensive overview of the data, a summary of the results -- including system rankings and performance scores -- detailed descriptions of the participating systems, and an in-depth analysis of submissions. https://github.com/mbzuai-nlp/COLING-2025-Workshop-on-MGT-Detection-Task1
中文: COLING 2025的GenAI内容检测任务1聚焦于二进制机器生成文本检测,包含单语和多语言子任务,分别吸引了36支和26支团队参与,并提供了数据概览、结果总结、系统描述及提交内容的深入分析。
English: The GenAI Content Detection Task 1 at COLING 2025 involved binary machine-generated text detection with Monolingual and Multilingual subtasks, attracting 36 and 26 teams respectively, and included data overviews, results, system descriptions, and submission analyses.

Authors:Zhanpeng Chen, Mingxiao Li, Ziyang Chen, Nan Du, Xiaolong Li, Yuexian Zou
Title: Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
Abstract:
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence, yet the irrational encoding of visual positions persists in inhibiting the models' comprehensive perception performance across different levels of granularity. In this work, we propose Pyramid-descent Visual Position Encoding (PyPE), a novel approach designed to enhance the perception of visual tokens within VLMs. By assigning visual position indexes from the periphery to the center and expanding the central receptive field incrementally, PyPE addresses the limitations of traditional raster-scan methods and mitigates the long-term decay effects induced by Rotary Position Embedding (RoPE). Our method reduces the relative distance between interrelated visual elements and instruction tokens, promoting a more rational allocation of attention weights and allowing for a multi-granularity perception of visual elements and countering the over-reliance on anchor tokens. Extensive experimental evaluations demonstrate that PyPE consistently improves the general capabilities of VLMs across various sizes. Code is available at https://github.com/SakuraTroyChen/PyPE.
中文: 提出的金字塔式视觉位置编码(PyPE)通过从外围到中心的视觉位置索引和逐步扩展中心感受野,增强了视觉语言模型的多粒度感知能力,有效提升了不同规模模型的综合性能。
English: The proposed Pyramid-descent Visual Position Encoding (PyPE) enhances vision-language models by rationally encoding visual positions to improve multi-granularity perception and attention allocation, consistently boosting model performance across various scales.

Authors:Jing Ding, Kai Feng, Binbin Lin, Jiarui Cai, Qiushi Wang, Yu Xie, Xiaojin Zhang, Zhongyu Wei, Wei Chen
Title: InsQABench: Benchmarking Chinese Insurance Domain Question Answering with Large Language Models
Abstract:
The application of large language models (LLMs) has achieved remarkable success in various fields, but their effectiveness in specialized domains like the Chinese insurance industry remains underexplored. The complexity of insurance knowledge, encompassing specialized terminology and diverse data types, poses significant challenges for both models and users. To address this, we introduce InsQABench, a benchmark dataset for the Chinese insurance sector, structured into three categories: Insurance Commonsense Knowledge, Insurance Structured Database, and Insurance Unstructured Documents, reflecting real-world insurance question-answering tasks.We also propose two methods, SQL-ReAct and RAG-ReAct, to tackle challenges in structured and unstructured data tasks. Evaluations show that while LLMs struggle with domain-specific terminology and nuanced clause texts, fine-tuning on InsQABench significantly improves performance. Our benchmark establishes a solid foundation for advancing LLM applications in the insurance domain, with data and code available at https://github.com/HaileyFamo/InsQABench.git.
Chinese Summary: 本研究针对中文保险领域推出InsQABench评估基准,通过实验证明虽然大语言模型在专业术语理解上存在困难,但基于该数据集的微调能显著提升模型在保险问答任务中的表现。
English Summary: This study introduces InsQABench, a specialized benchmark for evaluating large language models in the Chinese insurance industry, and demonstrates that fine-tuning with this dataset significantly enhances model performance despite initial challenges with domain-specific terminology.

Authors:Saibo Geng, Hudson Cooper, Michał Moskal, Samuel Jenkins, Julian Berman, Nathan Ranchin, Robert West, Eric Horvitz, Harsha Nori
Title: JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models
Abstract:
Reliably generating structured outputs has become a critical capability for modern language model (LM) applications. Constrained decoding has emerged as the dominant technology across sectors for enforcing structured outputs during generation. Despite its growing adoption, little has been done with the systematic evaluation of the behaviors and performance of constrained decoding. Constrained decoding frameworks have standardized around JSON Schema as a structured data format, with most uses guaranteeing constraint compliance given a schema. However, there is poor understanding of the effectiveness of the methods in practice. We present an evaluation framework to assess constrained decoding approaches across three critical dimensions: efficiency in generating constraint-compliant outputs, coverage of diverse constraint types, and quality of the generated outputs. To facilitate this evaluation, we introduce JSONSchemaBench, a benchmark for constrained decoding comprising 10K real-world JSON schemas that encompass a wide range of constraints with varying complexity. We pair the benchmark with the existing official JSON Schema Test Suite and evaluate six state-of-the-art constrained decoding frameworks, including Guidance, Outlines, Llamacpp, XGrammar, OpenAI, and Gemini. Through extensive experiments, we gain insights into the capabilities and limitations of constrained decoding on structured generation with real-world JSON schemas. Our work provides actionable insights for improving constrained decoding frameworks and structured generation tasks, setting a new standard for evaluating constrained decoding and structured generation. We release JSONSchemaBench at https://github.com/guidance-ai/jsonschemabench
Chinese: 本文提出了一个评估框架和JSONSchemaBench基准,通过测试六种先进约束解码框架,系统评估了语言模型结构化输出的约束解码方法,揭示了其在效率、约束覆盖和输出质量方面的重要特性。
English: This paper introduces a comprehensive evaluation framework and JSONSchemaBench benchmark to systematically assess constrained decoding methods for structured language model outputs, revealing key insights about their efficiency, constraint coverage, and output quality through testing six state-of-the-art frameworks.

Authors:Ruixuan Zhang, Beichen Wang, Juexiao Zhang, Zilin Bian, Chen Feng, Kaan Ozbay
Title: When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
Abstract:
The increasing availability of traffic videos functioning on a 24/7/365 time scale has the great potential of increasing the spatio-temporal coverage of traffic accidents, which will help improve traffic safety. However, analyzing footage from hundreds, if not thousands, of traffic cameras in a 24/7/365 working protocol remains an extremely challenging task, as current vision-based approaches primarily focus on extracting raw information, such as vehicle trajectories or individual object detection, but require laborious post-processing to derive actionable insights. We propose SeeUnsafe, a new framework that integrates Multimodal Large Language Model (MLLM) agents to transform video-based traffic accident analysis from a traditional extraction-then-explanation workflow to a more interactive, conversational approach. This shift significantly enhances processing throughput by automating complex tasks like video classification and visual grounding, while improving adaptability by enabling seamless adjustments to diverse traffic scenarios and user-defined queries. Our framework employs a severity-based aggregation strategy to handle videos of various lengths and a novel multimodal prompt to generate structured responses for review and evaluation and enable fine-grained visual grounding. We introduce IMS (Information Matching Score), a new MLLM-based metric for aligning structured responses with ground truth. We conduct extensive experiments on the Toyota Woven Traffic Safety dataset, demonstrating that SeeUnsafe effectively performs accident-aware video classification and visual grounding by leveraging off-the-shelf MLLMs. Source code will be available at \url{https://github.com/ai4ce/SeeUnsafe}.
中文: SeeUnsafe框架通过多模态大语言模型代理,将传统交通事故视频分析转变为交互式对话处理,利用基于严重程度的聚合策略自动完成分类和视觉定位任务,显著提升了处理效率与场景适应性。
English: The SeeUnsafe framework utilizes Multimodal Large Language Model agents to revolutionize traffic accident analysis by enabling interactive, conversational processing of surveillance videos, automating classification and visual grounding tasks while adapting to diverse scenarios through a severity-based aggregation strategy.

Authors:Kazuma Onishi, Katsuhiko Hayashi
Title: A Simple but Effective Closed-form Solution for Extreme Multi-label Learning
Abstract:
Extreme multi-label learning (XML) is a task of assigning multiple labels from an extremely large set of labels to each data instance. Many current high-performance XML models are composed of a lot of hyperparameters, which complicates the tuning process. Additionally, the models themselves are adapted specifically to XML, which complicates their reimplementation. To remedy this problem, we propose a simple method based on ridge regression for XML. The proposed method not only has a closed-form solution but also is composed of a single hyperparameter. Since there are no precedents on applying ridge regression to XML, this paper verified the performance of the method by using various XML benchmark datasets. Furthermore, we enhanced the prediction of low-frequency labels in XML, which hold informative content. This prediction is essential yet challenging because of the limited amount of data. Here, we employed a simple frequency-based weighting. This approach greatly simplifies the process compared with existing techniques. Experimental results revealed that it can achieve levels of performance comparable to, or even exceeding, those of models with numerous hyperparameters. Additionally, we found that the frequency-based weighting significantly improved the predictive performance for low-frequency labels, while requiring almost no changes in implementation. The source code for the proposed method is available on github at https://github.com/cars1015/XML-ridge.
中文: 本文提出了一种基于岭回归的极大多标签学习方法,仅使用单一超参数,并通过频率加权提升低频标签的预测效果,其性能达到甚至超越了复杂模型。
English: This paper introduces a simple ridge regression-based method for extreme multi-label learning that uses only one hyperparameter and employs frequency-based weighting to enhance predictions for low-frequency labels, achieving performance comparable to or better than complex models.

Authors:Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, Jie Tang
Title: ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Abstract:
Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the complexity of data collection and evaluation. In this work, we introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling, which requires long-parameter filing, parameter value reasoning, and 128k long context. Additionally, we propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks. Through comprehensive experiments, we demonstrate the deficiencies of state-of-the-art LLMs in function calling and suggest future directions for optimizing these capabilities. The data and code are available at \url{https://github.com/THUDM/ComplexFuncBench}.
中文摘要:本研究提出了ComplexFuncBench基准,用于评估大语言模型在现实场景中的复杂函数调用能力,并开发了自动评估框架ComplexEval,揭示了当前模型的不足并为未来优化指明了方向。
English Summary: This study introduces ComplexFuncBench, a benchmark for evaluating complex function calling in large language models across real-world scenarios, and proposes an automatic evaluation framework called ComplexEval to identify deficiencies in current models and guide future improvements.

Authors:Zilyu Ji, Yuntian Shen, Jionghao Lin, Kenneth R. Koedinger
Title: Enhancing the De-identification of Personally Identifiable Information in Educational Data
Abstract:
Protecting Personally Identifiable Information (PII), such as names, is a critical requirement in learning technologies to safeguard student and teacher privacy and maintain trust. Accurate PII detection is an essential step toward anonymizing sensitive information while preserving the utility of educational data. Motivated by recent advancements in artificial intelligence, our study investigates the GPT-4o-mini model as a cost-effective and efficient solution for PII detection tasks. We explore both prompting and fine-tuning approaches and compare GPT-4o-mini's performance against established frameworks, including Microsoft Presidio and Azure AI Language. Our evaluation on two public datasets, CRAPII and TSCC, demonstrates that the fine-tuned GPT-4o-mini model achieves superior performance, with a recall of 0.9589 on CRAPII. Additionally, fine-tuned GPT-4o-mini significantly improves precision scores (a threefold increase) while reducing computational costs to nearly one-tenth of those associated with Azure AI Language. Furthermore, our bias analysis reveals that the fine-tuned GPT-4o-mini model consistently delivers accurate results across diverse cultural backgrounds and genders. The generalizability analysis using the TSCC dataset further highlights its robustness, achieving a recall of 0.9895 with minimal additional training data from TSCC. These results emphasize the potential of fine-tuned GPT-4o-mini as an accurate and cost-effective tool for PII detection in educational data. It offers robust privacy protection while preserving the data's utility for research and pedagogical analysis. Our code is available on GitHub: https://github.com/AnonJD/PrivacyAI
中文: 本研究表明,经过微调的GPT-4o-mini模型在个人身份信息检测任务中优于现有框架,不仅显著提升了召回率与精确度,还将计算成本降至十分之一,同时在不同文化背景和性别群体中保持稳定性能。
English: This study demonstrates that fine-tuned GPT-4o-mini outperforms existing frameworks in PII detection, achieving superior recall and precision while significantly reducing computational costs and maintaining accuracy across diverse demographics.

Authors:Zekun Xi, Wenbiao Yin, Jizhan Fang, Jialong Wu, Runnan Fang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang
Title: OmniThink: Expanding Knowledge Boundaries in Machine Writing through Thinking
Abstract:
Machine writing with large language models often relies on retrieval-augmented generation. However, these approaches remain confined within the boundaries of the model's predefined scope, limiting the generation of content with rich information. Specifically, vanilla-retrieved information tends to lack depth, novelty, and suffers from redundancy, which negatively impacts the quality of generated articles, leading to shallow, unoriginal, and repetitive outputs. To address these issues, we propose OmniThink, a slow-thinking machine writing framework that emulates the human-like process of iterative expansion and reflection. The core idea behind OmniThink is to simulate the cognitive behavior of learners as they slowly deepen their knowledge of the topics. Experimental results demonstrate that OmniThink improves the knowledge density of generated articles without compromising metrics such as coherence and depth. Human evaluations and expert feedback further highlight the potential of OmniThink to address real-world challenges in the generation of long-form articles. Code is available at https://github.com/zjunlp/OmniThink.
中文摘要:OmniThink是一种慢思考机器写作框架,通过模拟人类迭代学习过程来克服检索增强生成的局限性,能在保持连贯性和深度的同时提高生成文章的知识密度。
English Summary: OmniThink is a slow-thinking machine writing framework designed to overcome the limitations of retrieval-augmented generation by mimicking human iterative learning, resulting in articles with higher knowledge density while maintaining coherence and depth.

Authors:Zhaocheng Liu, Quan Tu, Wen Ye, Yu Xiao, Zhishou Zhang, Hengfu Cui, Yalun Zhu, Qiang Ju, Shizheng Li, Jian Xie
Title: Exploring the Inquiry-Diagnosis Relationship with Advanced Patient Simulators
Abstract:
Recently, large language models have shown great potential to transform online medical consultation. Despite this, most research targets improving diagnostic accuracy with ample information, often overlooking the inquiry phase. Some studies try to evaluate or refine doctor models by using prompt-engineered patient agents. However, prompt engineering alone falls short in accurately simulating real patients. We need to explore new paradigms for patient simulation. Furthermore, the relationship between inquiry and diagnosis remains unexplored. This paper extracts dialogue strategies from real doctor-patient conversations to guide the training of a patient simulator. Our simulator shows higher anthropomorphism and lower hallucination rates, using dynamic dialogue strategies. This innovation offers a more accurate evaluation of diagnostic models and generates realistic synthetic data. We conduct extensive experiments on the relationship between inquiry and diagnosis, showing they adhere to Liebig's law: poor inquiry limits diagnosis effectiveness, regardless of diagnostic skill, and vice versa. The experiments also reveal substantial differences in inquiry performance among models. To delve into this phenomenon, the inquiry process is categorized into four distinct types. Analyzing the distribution of inquiries across these types helps explain the performance differences. The weights of our patient simulator are available https://github.com/PatientSimulator/PatientSimulator.
中文摘要:本文通过提取真实医患对话的策略训练患者模拟器,提升了拟人化程度并降低了幻觉率,从而更准确地评估诊断模型和生成合成数据,同时基于李比希定律揭示了问诊与诊断的相互制约关系。
English Summary: This paper introduces a patient simulator trained with dialogue strategies from real doctor-patient conversations, enhancing anthropomorphism and reducing hallucinations to better evaluate diagnostic models and generate synthetic data, while also exploring the critical relationship between inquiry and diagnosis under Liebig's law.

Authors:Fen Wang, Bomiao Wang, Xueli Shu, Zhen Liu, Zekai Shao, Chao Liu, Siming Chen
Title: ChartInsighter: An Approach for Mitigating Hallucination in Time-series Chart Summary Generation with A Benchmark Dataset
Abstract:
Effective chart summary can significantly reduce the time and effort decision makers spend interpreting charts, enabling precise and efficient communication of data insights. Previous studies have faced challenges in generating accurate and semantically rich summaries of time-series data charts. In this paper, we identify summary elements and common hallucination types in the generation of time-series chart summaries, which serve as our guidelines for automatic generation. We introduce ChartInsighter, which automatically generates chart summaries of time-series data, effectively reducing hallucinations in chart summary generation. Specifically, we assign multiple agents to generate the initial chart summary and collaborate iteratively, during which they invoke external data analysis modules to extract insights and compile them into a coherent summary. Additionally, we implement a self-consistency test method to validate and correct our summary. We create a high-quality benchmark of charts and summaries, with hallucination types annotated on a sentence-by-sentence basis, facilitating the evaluation of the effectiveness of reducing hallucinations. Our evaluations using our benchmark show that our method surpasses state-of-the-art models, and that our summary hallucination rate is the lowest, which effectively reduces various hallucinations and improves summary quality. The benchmark is available at https://github.com/wangfen01/ChartInsighter.
中文摘要:ChartInsighter通过多智能体协作与自一致性检验方法,有效减少时序图表摘要中的幻觉现象,在新建基准测试中表现优于现有最优模型。
English Summary: ChartInsighter is an automated system that reduces hallucinations in time-series chart summaries through multi-agent collaboration and self-consistency testing, achieving state-of-the-art performance on a newly created benchmark.

Authors:Eshaan Tanwar, Gayatri Oke, Tanmoy Chakraborty
Title: Multilingual LLMs Struggle to Link Orthography and Semantics in Bilingual Word Processing
Abstract:
Bilingual lexical processing is shaped by the complex interplay of phonological, orthographic, and semantic features of two languages within an integrated mental lexicon. In humans, this is evident in the ease with which cognate words - words similar in both orthographic form and meaning (e.g., blind, meaning "sightless" in both English and German) - are processed, compared to the challenges posed by interlingual homographs, which share orthographic form but differ in meaning (e.g., gift, meaning "present" in English but "poison" in German). We investigate how multilingual Large Language Models (LLMs) handle such phenomena, focusing on English-Spanish, English-French, and English-German cognates, non-cognate, and interlingual homographs. Specifically, we evaluate their ability to disambiguate meanings and make semantic judgments, both when these word types are presented in isolation or within sentence contexts. Our findings reveal that while certain LLMs demonstrate strong performance in recognizing cognates and non-cognates in isolation, they exhibit significant difficulty in disambiguating interlingual homographs, often performing below random baselines. This suggests LLMs tend to rely heavily on orthographic similarities rather than semantic understanding when interpreting interlingual homographs. Further, we find LLMs exhibit difficulty in retrieving word meanings, with performance in isolative disambiguation tasks having no correlation with semantic understanding. Finally, we study how the LLM processes interlingual homographs in incongruent sentences. We find models to opt for different strategies in understanding English and non-English homographs, highlighting a lack of a unified approach to handling cross-lingual ambiguities.
中文: 多语言大语言模型在识别同源词和非同源词方面表现良好,但在区分跨语言同形异义词时存在显著困难,往往依赖拼写相似性而非语义理解,且缺乏处理跨语言歧义的一致性策略。
English: Multilingual Large Language Models (LLMs) show proficiency in recognizing cognates and non-cognates but struggle significantly with disambiguating interlingual homographs, often relying on orthographic cues over semantic understanding and lacking a consistent strategy for cross-lingual ambiguities.

Authors:Ruixiang Jiang, Changwen Chen
Title: Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Abstract:
The rapid technical progress of generative art (GenArt) has democratized the creation of visually appealing imagery. However, achieving genuine artistic impact - the kind that resonates with viewers on a deeper, more meaningful level - remains formidable as it requires a sophisticated aesthetic sensibility. This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. Our analysis reveals a critical challenge: MLLMs exhibit a tendency towards hallucinations during aesthetic reasoning, characterized by subjective opinions and unsubstantiated artistic interpretations. We further demonstrate that these hallucinations can be suppressed by employing an evidence-based and objective reasoning process, as substantiated by our proposed baseline, ArtCoT. MLLMs prompted by this principle produce multifaceted, in-depth aesthetic reasoning that aligns significantly better with human judgment. These findings have direct applications in areas such as AI art tutoring and as reward models for image generation. Ultimately, we hope this work paves the way for AI systems that can truly understand, appreciate, and contribute to art that aligns with human aesthetic values. Project homepage: https://github.com/songrise/MLLM4Art.
中文: 本文提出ArtCoT方法,通过基于证据的推理过程抑制多模态大语言模型在审美判断中的幻觉,使其评估更符合人类审美标准,可应用于AI艺术辅导和图像生成领域。
English: This paper introduces ArtCoT, a method that leverages Multimodal LLMs' reasoning to perform aesthetic judgments by suppressing hallucinations through evidence-based processes, resulting in evaluations that better align with human values and enabling applications in AI art tutoring and image generation.

Authors:Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
Title: MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Abstract:
Multimodal document retrieval aims to identify and retrieve various forms of multimodal content, such as figures, tables, charts, and layout information from extensive documents. Despite its increasing popularity, there is a notable lack of a comprehensive and robust benchmark to effectively evaluate the performance of systems in such tasks. To address this gap, this work introduces a new benchmark, named MMDocIR, that encompasses two distinct tasks: page-level and layout-level retrieval. The former evaluates the performance of identifying the most relevant pages within a long document, while the later assesses the ability of detecting specific layouts, providing a more fine-grained measure than whole-page analysis. A layout refers to a variety of elements, including textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring 1,685 questions annotated by experts and 173,843 questions with bootstrapped labels, making it a valuable resource in multimodal document retrieval for both training and evaluation. Through rigorous experiments, we demonstrate that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR training set effectively enhances the performance of multimodal document retrieval and (iii) text retrievers leveraging VLM-text significantly outperforms retrievers relying on OCR-text. Our dataset is available at https://mmdocrag.github.io/MMDocIR/.
中文: 本研究提出了MMDocIR这一综合性基准,用于评估多模态文档检索,包含页面级和布局级任务,结合专家标注和自举数据集,有效提升了系统性能评估与训练效果。
English: This work introduces MMDocIR, a comprehensive benchmark for multimodal document retrieval that includes page-level and layout-level tasks, featuring expert-annotated and bootstrapped datasets to enhance system performance evaluation and training.

Authors:Irina Bigoulaeva, Harish Tayyar Madabushi, Iryna Gurevych
Title: The Inherent Limits of Pretrained LLMs: The Unexpected Convergence of Instruction Tuning and In-Context Learning Capabilities
Abstract:
Large Language Models (LLMs), trained on extensive web-scale corpora, have demonstrated remarkable abilities across diverse tasks, especially as they are scaled up. Nevertheless, even state-of-the-art models struggle in certain cases, sometimes failing at problems solvable by young children, indicating that traditional notions of task complexity are insufficient for explaining LLM capabilities. However, exploring LLM capabilities is complicated by the fact that most widely-used models are also "instruction-tuned" to respond appropriately to prompts. With the goal of disentangling the factors influencing LLM performance, we investigate whether instruction-tuned models possess fundamentally different capabilities from base models that are prompted using in-context examples. Through extensive experiments across various model families, scales and task types, which included instruction tuning 90 different LLMs, we demonstrate that the performance of instruction-tuned models is significantly correlated with the in-context performance of their base counterparts. By clarifying what instruction-tuning contributes, we extend prior research into in-context learning, which suggests that base models use priors from pretraining data to solve tasks. Specifically, we extend this understanding to instruction-tuned models, suggesting that their pretraining data similarly sets a limiting boundary on the tasks they can solve, with the added influence of the instruction-tuning dataset.
Chinese: 大型语言模型的研究表明,指令微调模型的性能与其基础模型的上下文学习能力密切相关,预训练和指令微调数据共同决定了它们的能力边界。
English: Large Language Models (LLMs) show that instruction-tuned models' performance is strongly linked to their base models' in-context learning, with both pretraining and instruction-tuning data shaping their capabilities within set limits.

Authors:Han Wang, Jianqiang Li, Qing Zhao, Zhonglong Chen, Changwei Song, Jing Tang, Yuning Huang, Wei Zhai, Yongsheng Tong, Guanghui Fu
Title: Deep Learning-Based Feature Fusion for Emotion Analysis and Suicide Risk Differentiation in Chinese Psychological Support Hotlines
Abstract:
Mental health is a critical global public health issue, and psychological support hotlines play a pivotal role in providing mental health assistance and identifying suicide risks at an early stage. However, the emotional expressions conveyed during these calls remain underexplored in current research. This study introduces a method that combines pitch acoustic features with deep learning-based features to analyze and understand emotions expressed during hotline interactions. Using data from China's largest psychological support hotline, our method achieved an F1-score of 79.13% for negative binary emotion classification.Additionally, the proposed approach was validated on an open dataset for multi-class emotion classification,where it demonstrated better performance compared to the state-of-the-art methods. To explore its clinical relevance, we applied the model to analysis the frequency of negative emotions and the rate of emotional change in the conversation, comparing 46 subjects with suicidal behavior to those without. While the suicidal group exhibited more frequent emotional changes than the non-suicidal group, the difference was not statistically significant.Importantly, our findings suggest that emotional fluctuation intensity and frequency could serve as novel features for psychological assessment scales and suicide risk prediction.The proposed method provides valuable insights into emotional dynamics and has the potential to advance early intervention and improve suicide prevention strategies through integration with clinical tools and assessments The source code is publicly available at https://github.com/Sco-field/Speechemotionrecognition/tree/main.
中文: 本研究提出了一种结合音高特征与深度学习的方法来分析心理热线通话中的情绪表达,该方法在情绪分类中表现出色,并发现情绪波动特征可作为自杀风险评估和预防策略的新指标。
English: This study develops a method combining pitch and deep learning features to analyze emotions in psychological hotline calls, achieving high accuracy in emotion classification and identifying emotional fluctuation patterns as potential indicators for suicide risk assessment and prevention strategies.

Authors:Qian Wang, Jiaying Wu, Zhenheng Tang, Bingqiao Luo, Nuo Chen, Wei Chen, Bingsheng He
Title: What Limits LLM-based Human Simulation: LLMs or Our Design?
Abstract:
We argue that advancing LLM-based human simulation requires addressing both LLM's inherent limitations and simulation framework design challenges. Recent studies have revealed significant gaps between LLM-based human simulations and real-world observations, highlighting these dual challenges. To address these gaps, we present a comprehensive analysis of LLM limitations and our design issues, proposing targeted solutions for both aspects. Furthermore, we explore future directions that address both challenges simultaneously, particularly in data collection, LLM generation, and evaluation. To support further research in this field, we provide a curated collection of LLM-based human simulation resources.\footnote{https://github.com/Persdre/llm-human-simulation}
推进基于大语言模型的人类模拟需同时解决模型固有局限与框架设计难题,通过针对性方案和未来研究方向缩小其与真实观察间的差距。
Advancing LLM-based human simulation requires addressing both inherent model limitations and framework design challenges, with proposed solutions and future directions to bridge gaps with real-world observations.

Authors:Kewei Li, Yanwen Kong, Yiping Xu, Jianlin Su, Lan Huang, Ruochi Zhang, Fengfeng Zhou
Title: Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms
Abstract:
Since the emergence of research on improving the length extrapolation capabilities of large language models in 2021, some studies have made modifications to the scaling factor in the scaled dot-product attention mechanism as part of their proposed methods without rigorous theoretical justifications. To fill this gap, we propose two new scaled temperatures based on information entropy invariance to enhance length extrapolation. First, a training-free method InfoScale is designed for dotproduct attention, and preserves focus on original tokens during length extrapolation by ensuring consistent entropy. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-ofthe-art performance on the GAU-α model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates the Windowed Attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/ Information-Entropy-Invariance.
中文: 本研究基于信息熵不变性提出了InfoScale和CosScale两种新缩放方法,显著增强了语言模型的长度外推能力,通过将上下文窗口扩展至训练长度的64倍实现了最优性能。
English: This study introduces InfoScale and CosScale, two novel scaling methods based on information entropy invariance, which significantly enhance length extrapolation in language models and achieve state-of-the-art performance by extending context windows up to 64 times the training length.

Authors:MiniMax, Aonian Li, Bangwei Gong, Bo Yang, Boji Shan, Chang Liu, Cheng Zhu, Chunhao Zhang, Congchao Guo, Da Chen, Dong Li, Enwei Jiao, Gengxin Li, Guojun Zhang, Haohai Sun, Houze Dong, Jiadai Zhu, Jiaqi Zhuang, Jiayuan Song, Jin Zhu, Jingtao Han, Jingyang Li, Junbin Xie, Junhao Xu, Junjie Yan, Kaishun Zhang, Kecheng Xiao, Kexi Kang, Le Han, Leyang Wang, Lianfei Yu, Liheng Feng, Lin Zheng, Linbo Chai, Long Xing, Meizhi Ju, Mingyuan Chi, Mozhi Zhang, Peikai Huang, Pengcheng Niu, Pengfei Li, Pengyu Zhao, Qi Yang, Qidi Xu, Qiexiang Wang, Qin Wang, Qiuhui Li, Ruitao Leng, Shengmin Shi, Shuqi Yu, Sichen Li, Songquan Zhu, Tao Huang, Tianrun Liang, Weigao Sun, Weixuan Sun, Weiyu Cheng, Wenkai Li, Xiangjun Song, Xiao Su, Xiaodong Han, Xinjie Zhang, Xinzhu Hou, Xu Min, Xun Zou, Xuyang Shen, Yan Gong, Yingjie Zhu, Yipeng Zhou, Yiran Zhong, Yongyi Hu, Yuanxiang Fan, Yue Yu, Yufeng Yang, Yuhao Li, Yunan Huang, Yunji Li, Yunpeng Huang, Yunzhi Xu, Yuxin Mao, Zehan Li, Zekang Li, Zewei Tao, Zewen Ying, Zhaoyang Cong, Zhen Qin, Zhenhua Fan, Zhihang Yu, Zhuo Jiang, Zijia Wu
Title: MiniMax-01: Scaling Foundation Models with Lightning Attention
Abstract:
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window. We publicly release MiniMax-01 at https://github.com/MiniMax-AI.
MiniMax-01系列模型融合闪电注意力与专家混合技术,在保持顶尖性能的同时突破上下文长度限制,以更低成本实现百万级 token 的高效处理。
MiniMax-01系列模型通过闪电注意力和专家混合架构实现了与顶尖模型相媲美的性能,同时支持超长上下文处理,训练和推理成本显著降低。

Authors:Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Saminu Mohammad Aliyu, Nelson Odhiambo Onyango, Lilian D. A. Wanzare, Samuel Rutunda, Lukman Jibril Aliyu, Esubalew Alemneh, Oumaima Hourrane, Hagos Tesfahun Gebremichael, Elyas Abdi Ismail, Meriem Beloucif, Ebrahim Chekol Jibril, Andiswa Bukula, Rooweither Mabuya, Salomey Osei, Abigail Oppong, Tadesse Destaw Belay, Tadesse Kebede Guge, Tesfa Tegegne Asfaw, Chiamaka Ijeoma Chukwuneke, Paul Röttger, Seid Muhie Yimam, Nedjma Ousidhoum
Title: AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
Abstract:
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and moderated. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate
中文摘要:全球南方地区因缺乏本地语言数据和忽视文化背景而存在仇恨言论审核不足的问题,AfriHate数据集通过提供15种非洲语言的原生标注内容,有效提升了仇恨言论的识别与分类能力。
English Summary: Hate speech moderation in the Global South faces challenges from inadequate data and cultural context, which the AfriHate dataset addresses by providing native-annotated content in 15 African languages to improve detection and classification.

Authors:Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray
Title: CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation
Abstract:
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks' shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .
Chinese: CWEval 是一个结果驱动的评估框架,通过高质量任务规范和结果驱动的测试机制,同时评估代码功能性和安全性,有效克服了以往基准测试的不足,显著提升了LLM生成代码的安全评估水平。
English: CWEval is an outcome-driven framework that enhances secure code generation evaluation by simultaneously assessing both functionality and security, overcoming limitations of previous benchmarks through high-quality specifications and multilingual testing.

Authors:Yijiong Yu, Ziyun Dai, Zekun Wang, Wei Wang, Ran Chen, Ji Pei
Title: OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training
Abstract:
Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.
中文: OpenCSG中文语料库通过提供多样化的高质量数据集,有效解决了中文大语言模型训练数据匮乏的问题,显著提升了模型在C-Eval等任务中的表现。
English: The OpenCSG Chinese Corpus addresses the scarcity of high-quality Chinese datasets for LLMs by providing diverse, scalable datasets that significantly enhance model performance in tasks like C-Eval.

Authors:Yin Fang, Xinle Deng, Kangwei Liu, Ningyu Zhang, Jingyang Qian, Penghui Yang, Xiaohui Fan, Huajun Chen
Title: A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following
Abstract:
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. In the life sciences, single-cell RNA sequencing (scRNA-seq) data serves as the "language of cellular biology", capturing intricate gene expression patterns at the single-cell level. However, interacting with this "language" through conventional tools is often inefficient and unintuitive, posing challenges for researchers. To address these limitations, we present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. We construct a comprehensive multi-modal instruction dataset that pairs text-based instructions with scRNA-seq profiles from diverse tissues and species. Building on this, we develop a multi-modal cell language architecture capable of simultaneously interpreting and processing both modalities. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands. Extensive evaluations demonstrate that InstructCell consistently meets or exceeds the performance of existing single-cell foundation models, while adapting to diverse experimental conditions. More importantly, InstructCell provides an accessible and intuitive tool for exploring complex single-cell data, lowering technical barriers and enabling deeper biological insights.
中文: InstructCell作为多模态AI助手,通过自然语言实现对单细胞RNA测序数据的直观灵活分析,在细胞注释和药物预测等任务中优于现有模型,同时显著降低了复杂生物数据的技术门槛。
English: InstructCell is a multimodal AI copilot that uses natural language to enable intuitive and flexible analysis of single-cell RNA sequencing data, outperforming existing models in tasks like cell annotation and drug prediction while making complex biological data more accessible.

Authors:Yaowen Ye, Cassidy Laidlaw, Jacob Steinhardt
Title: Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision
Abstract:
Language model (LM) post-training relies on two stages of human supervision: task demonstrations for supervised finetuning (SFT), followed by preference comparisons for reinforcement learning from human feedback (RLHF). As LMs become more capable, the tasks they are given become harder to supervise. Will post-training remain effective under unreliable supervision? To test this, we simulate unreliable demonstrations and comparison feedback using small LMs and time-constrained humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO (a common RLHF algorithm) fails to improve the model beyond SFT. To address this, we propose iterative label refinement (ILR) as an alternative to RLHF. ILR improves the SFT data by using comparison feedback to decide whether human demonstrations should be replaced by model-generated alternatives, then retrains the model via SFT on the updated data. SFT+ILR outperforms SFT+DPO on several tasks with unreliable supervision (math, coding, and safe instruction-following). Our findings suggest that as LMs are used for complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback; instead, it is better to direct feedback towards improving the training data rather than continually training the model. Our code and data are available at https://github.com/helloelwin/iterative-label-refinement.
中文: 当人类监督不可靠时,监督微调(SFT)仍保持部分有效性,但直接偏好优化(DPO)无法进一步优化模型,因此提出迭代标签优化(ILR)作为更优替代方案,通过提升训练数据质量而非持续训练模型来改善性能。
English: When human supervision becomes unreliable, supervised fine-tuning (SFT) remains partially effective, but direct preference optimization (DPO) fails to enhance the model further, prompting the introduction of iterative label refinement (ILR) as a superior alternative that improves training data quality instead of continuous model training.

Authors:Zhaokai Wang, Xizhou Zhu, Xue Yang, Gen Luo, Hao Li, Changyao Tian, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai
Title: Parameter-Inverted Image Pyramid Networks for Visual Perception and Multimodal Understanding
Abstract:
Image pyramids are widely adopted in top-performing methods to obtain multi-scale features for precise visual perception and understanding. However, current image pyramids use the same large-scale model to process multiple resolutions of images, leading to significant computational cost. To address this challenge, we propose a novel network architecture, called Parameter-Inverted Image Pyramid Networks (PIIP). Specifically, PIIP uses pretrained models (ViTs or CNNs) as branches to process multi-scale images, where images of higher resolutions are processed by smaller network branches to balance computational cost and performance. To integrate information from different spatial scales, we further propose a novel cross-branch feature interaction mechanism. To validate PIIP, we apply it to various perception models and a representative multimodal large language model called LLaVA, and conduct extensive experiments on various tasks such as object detection, segmentation, image classification and multimodal understanding. PIIP achieves superior performance compared to single-branch and existing multi-resolution approaches with lower computational cost. When applied to InternViT-6B, a large-scale vision foundation model, PIIP can improve its performance by 1%-2% on detection and segmentation with only 40%-60% of the original computation, finally achieving 60.0 box AP on MS COCO and 59.7 mIoU on ADE20K. For multimodal understanding, our PIIP-LLaVA achieves 73.0% accuracy on TextVQA and 74.5% on MMBench with only 2.8M training data. Our code is released at https://github.com/OpenGVLab/PIIP.
中文总结:提出的参数倒置图像金字塔网络通过使用更小的网络分支处理高分辨率图像来降低计算成本,同时利用跨分支特征交互保持性能,在多项视觉任务中以显著减少的计算量实现了更优表现。
English Summary: The proposed Parameter-Inverted Image Pyramid (PIIP) network processes higher-resolution images with smaller network branches to reduce computational costs while maintaining performance through cross-branch feature interactions, achieving superior results across various vision tasks with significantly lower computation.

Authors:Varun Biyyala, Bharat Chanderprakash Kathuria, Jialu Li, Youshan Zhang
Title: SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
Abstract:
Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \textbf{\href{https://github.com/custommetrics-sst/SST_CustomEvaluationMetrics.git}{GitHub Repository}}.
中文摘要:SST-EM是一种创新的视频编辑评估框架,通过整合语义提取、目标追踪、精细化处理和时序一致性检测,全面评估视频的语义保真度与时间连贯性。
English Summary: SST-EM is a novel video editing evaluation framework that integrates semantic extraction, object tracking, refinement, and temporal consistency checks to comprehensively assess semantic fidelity and temporal smoothness.

Authors:Yongyu Mu, Hengyu Li, Junxin Wang, Xiaoxuan Zhou, Chenglong Wang, Yingfeng Luo, Qiaozhi He, Tong Xiao, Guocheng Chen, Jingbo Zhu
Title: Boosting Text-To-Image Generation via Multilingual Prompting in Large Multimodal Models
Abstract:
Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained assessments, especially in human preference alignment. Additionally, with its advantage of generating more diverse images, PMT2I significantly outperforms baseline prompts when incorporated with reranking methods. Our code and parallel multilingual data can be found at https://github.com/takagi97/PMT2I.
Chinese: 本研究提出PMT2I方法,通过为大型多模态模型提供平行多语言提示来增强文本到图像的生成,在通用、组合和细粒度评估中表现优异,并能生成更多样化的图像。
English: This study introduces PMT2I, a method that enhances text-to-image generation by providing parallel multilingual prompts to large multimodal models, achieving superior performance in general, compositional, and fine-grained assessments while generating more diverse images.

Authors:Jianming Tong, Tianhao Huang, Leo de Castro, Anirudh Itagi, Jingtian Dang, Anupam Golder, Asra Ali, Jevin Jiang, Arvind, G. Edward Suh, Tushar Krishna
Title: Leveraging ASIC AI Chips for Homomorphic Encryption
Abstract:
Cloud-based services are making the outsourcing of sensitive client data increasingly common. Although homomorphic encryption (HE) offers strong privacy guarantee, it requires substantially more resources than computing on plaintext, often leading to unacceptably large latencies in getting the results. HE accelerators have emerged to mitigate this latency issue, but with the high cost of ASICs. In this paper we show that HE primitives can be converted to AI operators and accelerated on existing ASIC AI accelerators, like TPUs, which are already widely deployed in the cloud. Adapting such accelerators for HE requires (1) supporting modular multiplication, (2) high-precision arithmetic in software, and (3) efficient mapping on matrix engines. We introduce the CROSS compiler (1) to adopt Barrett reduction to provide modular reduction support using multiplier and adder, (2) Basis Aligned Transformation (BAT) to convert high-precision multiplication as low-precision matrix-vector multiplication, (3) Matrix Aligned Transformation (MAT) to covert vectorized modular operation with reduction into matrix multiplication that can be efficiently processed on 2D spatial matrix engine. Our evaluation of CROSS on a Google TPUv4 demonstrates significant performance improvements, with up to 161x and 5x speedup compared to the previous work on many-core CPUs and V100. The kernel-level codes are open-sourced at https://github.com/google/jaxite/tree/main/jaxite_word.
中文: 同态加密可通过将其原语转换为AI算子并在现有AI加速器(如TPU)上运行来加速,通过支持模乘运算和矩阵变换等技术实现了显著的性能提升。
English: Homomorphic encryption can be accelerated by converting its primitives into AI operators and running them on existing AI accelerators like TPUs, achieving significant speedups through techniques such as modular multiplication support and matrix transformations.

Authors:Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Title: A General Framework for Inference-time Scaling and Steering of Diffusion Models
Abstract:
Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
中文: 本文提出Feynman-Kac引导方法,这是一种无需训练即可通过奖励函数在推理时指导扩散模型的框架,在文本到图像生成和文本扩散任务中,其效果优于经过微调的模型,显著提升了样本质量和可控性。
English: This paper introduces Feynman-Kac (FK) steering, an inference-time framework that guides diffusion models using reward functions to enhance sample quality and controllability without requiring training, achieving superior results in text-to-image generation and text diffusion tasks compared to fine-tuned models.

Authors:Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu
Title: SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training
Abstract:
Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to $1000\times$ larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git
中文: 本文提出SPAM优化器,通过动量重置和梯度裁剪有效缓解大语言模型训练中的梯度尖峰问题,在多种任务中提升训练稳定性与资源效率,性能优于现有方法。
English: This paper introduces SPAM, a novel optimizer that mitigates gradient spikes in LLM training through momentum reset and adaptive clipping, improving stability and efficiency across various tasks and outperforming existing methods.

Authors:Mahmoud Ahmed, Xiang Li, Arpit Prajapati, Mohamed Elhoseiny
Title: 3DCoMPaT200: Language-Grounded Compositional Understanding of Parts and Materials of 3D Shapes
Abstract:
Understanding objects in 3D at the part level is essential for humans and robots to navigate and interact with the environment. Current datasets for part-level 3D object understanding encompass a limited range of categories. For instance, the ShapeNet-Part and PartNet datasets only include 16, and 24 object categories respectively. The 3DCoMPaT dataset, specifically designed for compositional understanding of parts and materials, contains only 42 object categories. To foster richer and fine-grained part-level 3D understanding, we introduce 3DCoMPaT200, a large-scale dataset tailored for compositional understanding of object parts and materials, with 200 object categories with $\approx$5 times larger object vocabulary compared to 3DCoMPaT and $\approx$ 4 times larger part categories. Concretely, 3DCoMPaT200 significantly expands upon 3DCoMPaT, featuring 1,031 fine-grained part categories and 293 distinct material classes for compositional application to 3D object parts. Additionally, to address the complexities of compositional 3D modeling, we propose a novel task of Compositional Part Shape Retrieval using ULIP to provide a strong 3D foundational model for 3D Compositional Understanding. This method evaluates the model shape retrieval performance given one, three, or six parts described in text format. These results show that the model's performance improves with an increasing number of style compositions, highlighting the critical role of the compositional dataset. Such results underscore the dataset's effectiveness in enhancing models' capability to understand complex 3D shapes from a compositional perspective. Code and Data can be found at http://github.com/3DCoMPaT200/3DCoMPaT200
中文: 为推进部件级三维物体理解,研究者提出了大规模数据集3DCoMPaT200,涵盖200个物体类别并显著扩展了部件与材料分类,同时通过ULIP框架创新性地引入组合式部件检索任务,证明模型性能随组合复杂度提升而增强。
English: To advance part-level 3D object understanding, the authors introduce 3DCoMPaT200, a large-scale dataset with 200 object categories, significantly expanding part and material classes, and propose a compositional part shape retrieval task using ULIP to enhance model performance with increasing compositional complexity.

Authors:Veronika Smilga
Title: Scaling Down Semantic Leakage: Investigating Associative Bias in Smaller Language Models
Abstract:
Semantic leakage is a phenomenon recently introduced by Gonen et al. (2024). It refers to a situation in which associations learnt from the training data emerge in language model generations in an unexpected and sometimes undesired way. Prior work has focused on leakage in large language models (7B+ parameters). In this study, I use Qwen2.5 model family to explore whether smaller models, ranging from 500M to 7B parameters, demonstrate less semantic leakage due to their limited capacity for capturing complex associations. Building on the previous dataset from Gonen et al. (2024), I introduce a new dataset of color-focused prompts, categorized into specific types of semantic associations, to systematically evaluate the models' performance. Results indicate that smaller models exhibit less semantic leakage overall, although this trend is not strictly linear, with medium-sized models sometimes surpassing larger ones in leaking behavior. The dataset, the model generations, and the evaluation code are publicly available at https://github.com/smilni/semantic_leakage_project.
中文: 本研究探讨了较小规模Qwen2.5模型(5亿至70亿参数)中的语义泄露现象,发现模型容量减小通常与较少泄露相关,但这种关系并非线性,中等规模模型有时反而表现出比大模型更明显的泄露行为。
English: This study investigates semantic leakage in smaller Qwen2.5 models (500M-7B parameters), revealing that reduced model capacity generally correlates with less leakage, though the relationship is non-linear and medium-sized models occasionally exhibit more leakage than larger ones.

Authors:Xiangru Tang, Tianyu Hu, Muyang Ye, Yanjun Shao, Xunjian Yin, Siru Ouyang, Wangchunshu Zhou, Pan Lu, Zhuosheng Zhang, Yilun Zhao, Arman Cohan, Mark Gerstein
Title: ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning
Abstract:
Chemical reasoning usually involves complex, multi-step processes that demand precise calculations, where even minor errors can lead to cascading failures. Furthermore, large language models (LLMs) encounter difficulties handling domain-specific formulas, executing reasoning steps accurately, and integrating code effectively when tackling chemical reasoning tasks. To address these challenges, we present ChemAgent, a novel framework designed to improve the performance of LLMs through a dynamic, self-updating library. This library is developed by decomposing chemical tasks into sub-tasks and compiling these sub-tasks into a structured collection that can be referenced for future queries. Then, when presented with a new problem, ChemAgent retrieves and refines pertinent information from the library, which we call memory, facilitating effective task decomposition and the generation of solutions. Our method designs three types of memory and a library-enhanced reasoning component, enabling LLMs to improve over time through experience. Experimental results on four chemical reasoning datasets from SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. Our findings suggest substantial potential for future applications, including tasks such as drug discovery and materials science. Our code can be found at https://github.com/gersteinlab/chemagent
Chinese: ChemAgent是一种创新框架,通过动态自更新的任务分解库和记忆检索机制提升大语言模型在化学推理中的表现,在多个数据集上实现高达46%的性能提升,展现出在药物发现等领域的应用潜力。
English: ChemAgent is a novel framework that enhances large language models' chemical reasoning by using a dynamic, self-updating library for task decomposition and memory retrieval, achieving up to 46% performance gains on datasets and showing promise for applications like drug discovery.

Authors:Tushar Aggarwal, Aarohi Bhand
Title: PASS: Presentation Automation for Slide Generation and Speech
Abstract:
In today's fast-paced world, effective presentations have become an essential tool for communication in both online and offline meetings. The crafting of a compelling presentation requires significant time and effort, from gathering key insights to designing slides that convey information clearly and concisely. However, despite the wealth of resources available, people often find themselves manually extracting crucial points, analyzing data, and organizing content in a way that ensures clarity and impact. Furthermore, a successful presentation goes beyond just the slides; it demands rehearsal and the ability to weave a captivating narrative to fully engage the audience. Although there has been some exploration of automating document-to-slide generation, existing research is largely centered on converting research papers. In addition, automation of the delivery of these presentations has yet to be addressed. We introduce PASS, a pipeline used to generate slides from general Word documents, going beyond just research papers, which also automates the oral delivery of the generated slides. PASS analyzes user documents to create a dynamic, engaging presentation with an AI-generated voice. Additionally, we developed an LLM-based evaluation metric to assess our pipeline across three critical dimensions of presentations: relevance, coherence, and redundancy. The data and codes are available at https://github.com/AggarwalTushar/PASS.
中文摘要:PASS是一种创新流程,能从通用Word文档自动生成演示文稿并实现AI语音播报,同时采用基于大语言模型的评估指标来衡量内容的相关性、连贯性和冗余度。
English Summary: PASS is an innovative pipeline that automates the creation and delivery of presentations from general Word documents, utilizing AI-generated voice and an LLM-based metric to evaluate relevance, coherence, and redundancy.

Authors:Rui Liu, Zhenqi Jia, Feilong Bao, Haizhou Li
Title: Retrieval-Augmented Dialogue Knowledge Aggregation for Expressive Conversational Speech Synthesis
Abstract:
Conversational speech synthesis (CSS) aims to take the current dialogue (CD) history as a reference to synthesize expressive speech that aligns with the conversational style. Unlike CD, stored dialogue (SD) contains preserved dialogue fragments from earlier stages of user-agent interaction, which include style expression knowledge relevant to scenarios similar to those in CD. Note that this knowledge plays a significant role in enabling the agent to synthesize expressive conversational speech that generates empathetic feedback. However, prior research has overlooked this aspect. To address this issue, we propose a novel Retrieval-Augmented Dialogue Knowledge Aggregation scheme for expressive CSS, termed RADKA-CSS, which includes three main components: 1) To effectively retrieve dialogues from SD that are similar to CD in terms of both semantic and style. First, we build a stored dialogue semantic-style database (SDSSD) which includes the text and audio samples. Then, we design a multi-attribute retrieval scheme to match the dialogue semantic and style vectors of the CD with the stored dialogue semantic and style vectors in the SDSSD, retrieving the most similar dialogues. 2) To effectively utilize the style knowledge from CD and SD, we propose adopting the multi-granularity graph structure to encode the dialogue and introducing a multi-source style knowledge aggregation mechanism. 3) Finally, the aggregated style knowledge are fed into the speech synthesizer to help the agent synthesize expressive speech that aligns with the conversational style. We conducted a comprehensive and in-depth experiment based on the DailyTalk dataset, which is a benchmarking dataset for the CSS task. Both objective and subjective evaluations demonstrate that RADKA-CSS outperforms baseline models in expressiveness rendering. Code and audio samples can be found at: https://github.com/Coder-jzq/RADKA-CSS.
Chinese: 提出的RADKA-CSS框架通过检索和聚合存储对话中的相关风格知识,显著提升了对话语音合成的表现力,优于现有基准模型。
English: The proposed RADKA-CSS framework enhances conversational speech synthesis by retrieving and aggregating relevant style knowledge from stored dialogues, significantly improving expressiveness over previous methods.

Authors:Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Zhen Qin, Yang Yuan, Quanquan Gu, Andrew C Yao
Title: Tensor Product Attention Is All You Need
Abstract:
Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor Product Attention Transformer,(T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.
中文: 本文提出张量积注意力(TPA),这是一种通过张量分解压缩键值缓存的新机制,能在语言任务中减少内存占用,同时保持或提升模型性能。
English: The paper introduces Tensor Product Attention (TPA), a novel mechanism that compresses key-value caches using tensor decompositions to reduce memory usage while maintaining or improving model performance in language tasks.

Authors:Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qinglin Zhang, Shiliang Zhang, Nan Zhao, Pei Zhang, Chong Zhang, Jinren Zhou
Title: MinMo: A Multimodal Large Language Model for Seamless Voice Interaction
Abstract:
Recent advancements in large language models (LLMs) and multimodal speech-text models have laid the groundwork for seamless voice interactions, enabling real-time, natural, and human-like conversations. Previous models for voice interactions are categorized as native and aligned. Native models integrate speech and text processing in one framework but struggle with issues like differing sequence lengths and insufficient pre-training. Aligned models maintain text LLM capabilities but are often limited by small datasets and a narrow focus on speech tasks. In this work, we introduce MinMo, a Multimodal Large Language Model with approximately 8B parameters for seamless voice interaction. We address the main limitations of prior aligned multimodal models. We train MinMo through multiple stages of speech-to-text alignment, text-to-speech alignment, speech-to-speech alignment, and duplex interaction alignment, on 1.4 million hours of diverse speech data and a broad range of speech tasks. After the multi-stage training, MinMo achieves state-of-the-art performance across various benchmarks for voice comprehension and generation while maintaining the capabilities of text LLMs, and also facilitates full-duplex conversation, that is, simultaneous two-way communication between the user and the system. Moreover, we propose a novel and simple voice decoder that outperforms prior models in voice generation. The enhanced instruction-following capabilities of MinMo supports controlling speech generation based on user instructions, with various nuances including emotions, dialects, and speaking rates, and mimicking specific voices. For MinMo, the speech-to-text latency is approximately 100ms, full-duplex latency is approximately 600ms in theory and 800ms in practice. The MinMo project web page is https://funaudiollm.github.io/minmo, and the code and models will be released soon.
中文: 本研究提出的MinMo是一个80亿参数的多模态大语言模型,通过多阶段训练在大量语音数据上实现了语音理解与生成的最优性能,同时支持全双工对话并具备增强的语音控制功能。
English: This work introduces MinMo, an 8-billion-parameter multimodal large language model that achieves state-of-the-art performance in voice comprehension and generation through multi-stage training on extensive speech data, while enabling full-duplex conversations and enhanced voice control capabilities.

Authors:Jing Guo, Nan Li, Ming Xu
Title: Environmental large language model Evaluation (ELLE) dataset: A Benchmark for Evaluating Generative AI applications in Eco-environment Domain
Abstract:
Generative AI holds significant potential for ecological and environmental applications such as monitoring, data analysis, education, and policy support. However, its effectiveness is limited by the lack of a unified evaluation framework. To address this, we present the Environmental Large Language model Evaluation (ELLE) question answer (QA) dataset, the first benchmark designed to assess large language models and their applications in ecological and environmental sciences. The ELLE dataset includes 1,130 question answer pairs across 16 environmental topics, categorized by domain, difficulty, and type. This comprehensive dataset standardizes performance assessments in these fields, enabling consistent and objective comparisons of generative AI performance. By providing a dedicated evaluation tool, ELLE dataset promotes the development and application of generative AI technologies for sustainable environmental outcomes. The dataset and code are available at https://elle.ceeai.net/ and https://github.com/CEEAI/elle.
中文: ELLE数据集作为首个生态与环境科学领域大语言模型评估基准,通过标准化性能测评推动生成式AI技术在可持续发展中的应用。
English: The ELLE dataset introduces the first benchmark for evaluating large language models in ecological and environmental sciences, providing a standardized tool to enhance generative AI applications for sustainable outcomes.

Authors:Jiayu Guo, Yu Guo, Martha Li, Songtao Tan
Title: FLAME: Financial Large-Language Model Assessment and Metrics Evaluation
Abstract:
LLMs have revolutionized NLP and demonstrated potential across diverse domains. More and more financial LLMs have been introduced for finance-specific tasks, yet comprehensively assessing their value is still challenging. In this paper, we introduce FLAME, a comprehensive financial LLMs evaluation system in Chinese, which includes two core evaluation benchmarks: FLAME-Cer and FLAME-Sce. FLAME-Cer covers 14 types of authoritative financial certifications, including CPA, CFA, and FRM, with a total of approximately 16,000 carefully selected questions. All questions have been manually reviewed to ensure accuracy and representativeness. FLAME-Sce consists of 10 primary core financial business scenarios, 21 secondary financial business scenarios, and a comprehensive evaluation set of nearly 100 tertiary financial application tasks. We evaluate 6 representative LLMs, including GPT-4o, GLM-4, ERNIE-4.0, Qwen2.5, XuanYuan3, and the latest Baichuan4-Finance, revealing Baichuan4-Finance excels other LLMs in most tasks. By establishing a comprehensive and professional evaluation system, FLAME facilitates the advancement of financial LLMs in Chinese contexts. Instructions for participating in the evaluation are available on GitHub: https://github.com/FLAME-ruc/FLAME.
中文:本文提出了FLAME,一个全面的中文金融大模型评估系统,包含金融认证与业务场景两大基准测试,评估显示百川4-金融在多数任务中表现最优。
English: This paper introduces FLAME, a comprehensive Chinese financial LLM evaluation system with two benchmarks—FLAME-Cer for financial certifications and FLAME-Sce for business scenarios—revealing Baichuan4-Finance's superior performance among tested models.

Authors:Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang
Title: VideoRAG: Retrieval-Augmented Generation over Video Corpus
Abstract:
Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.
中文: VideoRAG是一种创新框架,通过动态检索相关视频并利用大型视频语言模型处理多模态内容来增强回答生成,它通过引入帧选择机制和文本提取策略弥补了现有方法的不足,从而提高了事实准确性。
English: VideoRAG is a novel framework that dynamically retrieves relevant videos and leverages their multimodal content through Large Video Language Models to enhance response generation, addressing gaps in existing methods by incorporating frame selection and text extraction for improved accuracy.

Authors:Antonin Poché, Alon Jacovi, Agustin Martin Picard, Victor Boutin, Fanny Jourdan
Title: ConSim: Measuring Concept-Based Explanations' Effectiveness with Automated Simulatability
Abstract:
Concept-based explanations work by mapping complex model computations to human-understandable concepts. Evaluating such explanations is very difficult, as it includes not only the quality of the induced space of possible concepts but also how effectively the chosen concepts are communicated to users. Existing evaluation metrics often focus solely on the former, neglecting the latter. We introduce an evaluation framework for measuring concept explanations via automated simulatability: a simulator's ability to predict the explained model's outputs based on the provided explanations. This approach accounts for both the concept space and its interpretation in an end-to-end evaluation. Human studies for simulatability are notoriously difficult to enact, particularly at the scale of a wide, comprehensive empirical evaluation (which is the subject of this work). We propose using large language models (LLMs) as simulators to approximate the evaluation and report various analyses to make such approximations reliable. Our method allows for scalable and consistent evaluation across various models and datasets. We report a comprehensive empirical evaluation using this framework and show that LLMs provide consistent rankings of explanation methods. Code available at https://github.com/AnonymousConSim/ConSim.
中文摘要:本文提出了一种利用大型语言模型的自动可模拟性评估框架,用于衡量基于概念的解释方法,实现了跨模型和数据集的可扩展评估,同时兼顾概念质量与传达效果。
English Summary: This paper introduces an automated simulatability framework using large language models to evaluate concept-based explanations, enabling scalable assessment of both concept quality and communication effectiveness across diverse models and datasets.

Authors:Matthew Baas, Pieter Scholtz, Arnav Mehta, Elliott Dyson, Akshat Prakash, Herman Kamper
Title: MARS6: A Small and Robust Hierarchical-Codec Text-to-Speech Model
Abstract:
Codec-based text-to-speech (TTS) models have shown impressive quality with zero-shot voice cloning abilities. However, they often struggle with more expressive references or complex text inputs. We present MARS6, a robust encoder-decoder transformer for rapid, expressive TTS. MARS6 is built on recent improvements in spoken language modelling. Utilizing a hierarchical setup for its decoder, new speech tokens are processed at a rate of only 12 Hz, enabling efficient modelling of long-form text while retaining reconstruction quality. We combine several recent training and inference techniques to reduce repetitive generation and improve output stability and quality. This enables the 70M-parameter MARS6 to achieve similar performance to models many times larger. We show this in objective and subjective evaluations, comparing TTS output quality and reference speaker cloning ability. Project page: https://camb-ai.github.io/mars6-turbo/

Authors:You Li, Heyu Huang, Chi Chen, Kaiyu Huang, Chao Huang, Zonghao Guo, Zhiyuan Liu, Jinan Xu, Yuhua Li, Ruixuan Li, Maosong Sun
Title: Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
Abstract:
The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 24.94% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.
中文: 该研究提出了首个多图像精准定位模型Migician,并配套新数据集和基准测试,性能超越现有最佳模型24.94%。
English: The study introduces Migician, the first model for precise multi-image grounding, supported by a new dataset and benchmark, achieving a 24.94% performance improvement over existing models.

Authors:Sungjae Lee, Hyejin Park, Jaechang Kim, Jungseul Ok
Title: Semantic Exploration with Adaptive Gating for Efficient Problem Solving with Language Models
Abstract:
Recent advancements in large language models (LLMs) have shown remarkable potential in various complex tasks requiring multi-step reasoning methods like tree search to explore diverse reasoning paths. However, existing methods often suffer from computational inefficiency and redundancy. First, they overlook the diversity of task difficulties, leading to unnecessarily extensive searches even for easy tasks. Second, they neglect the semantics of reasoning paths, resulting in redundant exploration of semantically identical paths. To address these limitations, we propose Semantic Exploration with Adaptive Gating (SEAG), a computationally efficient method. SEAG employs an adaptive gating mechanism that dynamically decides whether to conduct a tree search, based on the confidence level of answers from a preceding simple reasoning method. Furthermore, its tree-based exploration consolidates semantically identical reasoning steps, reducing redundant explorations while maintaining or even improving accuracy. Our extensive experiments demonstrate that SEAG significantly improves accuracy by 4.3% on average while requiring only 31% of computational costs compared to existing tree search-based methods on complex reasoning benchmarks including GSM8K and ARC with diverse language models such as Llama2, Llama3, and Mistral. Our code is available at https://github.com/ml-postech/SEAG-semantic-exploration-with-adaptive-gating .
Chinese Summary: 提出的SEAG方法通过自适应门控机制根据任务难度动态调整搜索强度,并整合语义相同的推理路径,在仅需31%计算成本的情况下,相比现有树搜索方法平均准确率提升4.3%。
English Summary: The proposed Semantic Exploration with Adaptive Gating (SEAG) method enhances computational efficiency by dynamically adjusting search efforts based on task difficulty and consolidating semantically similar reasoning paths, achieving 4.3% higher accuracy with only 31% of computational costs compared to existing tree search methods.

Authors:Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang
Title: ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Abstract:
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.

Authors:Xi Ye, Fangcong Yin, Yinghui He, Joie Zhang, Howard Yen, Tianyu Gao, Greg Durrett, Danqi Chen
Title: LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation
Abstract:
Existing benchmarks for evaluating long-context language models (LCLMs) primarily focus on long-context recall, requiring models to produce short responses based on a few critical snippets while processing thousands of irrelevant tokens. We introduce LongProc (Long Procedural Generation), a new benchmark that requires both the integration of highly dispersed information and long-form generation. LongProc consists of six diverse procedural generation tasks, such as extracting structured information from HTML pages into a TSV format and executing complex search procedures to create travel plans. These tasks challenge LCLMs by testing their ability to follow detailed procedural instructions, synthesize and reason over dispersed information, and generate structured, long-form outputs (up to 8K tokens). Furthermore, as these tasks adhere to deterministic procedures and yield structured outputs, they enable reliable rule-based evaluation. We evaluated 23 LCLMs, including instruction-tuned models and recent reasoning models, on LongProc at three difficulty levels, with the maximum number of output tokens set at 500, 2K, and 8K. Notably, while all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models like GPT-4o show significant degradation on 8K-token tasks. Reasoning models achieve stronger overall performance in long-form generation, benefiting from long CoT training. Further analysis reveals that LCLMs struggle to maintain long-range coherence in long-form generations. These findings highlight critical limitations in current LCLMs and suggest substantial room for improvement. Data and code available at: https://princeton-pli.github.io/LongProc.

Authors:Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, Zhicheng Dou
Title: Search-o1: Agentic Search-Enhanced Large Reasoning Models
Abstract:
Large reasoning models (LRMs) like OpenAI-o1 have demonstrated impressive long stepwise reasoning capabilities through large-scale reinforcement learning. However, their extended reasoning processes often suffer from knowledge insufficiency, leading to frequent uncertainties and potential errors. To address this limitation, we introduce \textbf{Search-o1}, a framework that enhances LRMs with an agentic retrieval-augmented generation (RAG) mechanism and a Reason-in-Documents module for refining retrieved documents. Search-o1 integrates an agentic search workflow into the reasoning process, enabling dynamic retrieval of external knowledge when LRMs encounter uncertain knowledge points. Additionally, due to the verbose nature of retrieved documents, we design a separate Reason-in-Documents module to deeply analyze the retrieved information before injecting it into the reasoning chain, minimizing noise and preserving coherent reasoning flow. Extensive experiments on complex reasoning tasks in science, mathematics, and coding, as well as six open-domain QA benchmarks, demonstrate the strong performance of Search-o1. This approach enhances the trustworthiness and applicability of LRMs in complex reasoning tasks, paving the way for more reliable and versatile intelligent systems. The code is available at \url{https://github.com/sunnynexus/Search-o1}.
中文: Search-o1通过引入代理检索增强生成机制和文档内推理模块,动态获取并精炼外部知识,有效提升了大型推理模型在复杂任务中的性能和可信度。
English: Search-o1 enhances large reasoning models by integrating an agentic retrieval-augmented generation mechanism and a Reason-in-Documents module to dynamically retrieve and refine external knowledge, improving performance and reliability in complex reasoning tasks.

Authors:Chengxing Xie, Bowen Li, Chang Gao, He Du, Wai Lam, Difan Zou, Kai Chen
Title: SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
Abstract:
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks. One significant application of LLMs is in tackling software engineering challenges, particularly in resolving real-world tasks on GitHub by fixing code based on the issues reported by the users. However, many current approaches rely on proprietary LLMs, which limits reproducibility, accessibility, and transparency. The critical components of LLMs for addressing software engineering issues and how their capabilities can be effectively enhanced remain unclear. To address these challenges, we introduce SWE-Fixer, a novel open-source framework designed to effectively and efficiently resolve GitHub issues. SWE-Fixer comprises two essential modules: a code file retrieval module and a code editing module. The retrieval module employs BM25 along with a lightweight model to achieve coarse-to-fine file retrieval. Subsequently, the code editing module utilizes the other model to generate patches for the identified files. To mitigate the lack of publicly available datasets, we compile an extensive dataset that includes 110K GitHub issues along with their corresponding patches and train the two models of SWE-Fixer separately. We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models with scores of 22.0% and 30.2%. Furthermore, SWE-Fixer reaches state-of-the-art performance (24.7% on Lite and 32.8% on Verified) with PASS_TO_PASS (P2P) filtering. Additionally, our approach requires only two model calls per instance, making it significantly more efficient than existing methods. These results highlight the effectiveness of SWE-Fixer in real-world code-fixing scenarios. We will make our model, dataset, and code publicly available at https://github.com/InternLM/SWE-Fixer.
中文: SWE-Fixer是一个开源框架,通过文件检索和代码编辑双模块系统高效解决GitHub问题,在基准测试中取得优异性能且仅需少量模型调用。
English: SWE-Fixer is an open-source framework that efficiently resolves GitHub issues through a two-module system for file retrieval and code editing, achieving competitive performance on benchmarks while requiring minimal model calls.

Authors:Benjamin Reichman, Xiaofan Yu, Lanxiang Hu, Jack Truxal, Atishay Jain, Rushil Chandrupatla, Tajana Šimunić Rosing, Larry Heck
Title: SensorQA: A Question Answering Benchmark for Daily-Life Monitoring
Abstract:
With the rapid growth in sensor data, effectively interpreting and interfacing with these data in a human-understandable way has become crucial. While existing research primarily focuses on learning classification models, fewer studies have explored how end users can actively extract useful insights from sensor data, often hindered by the lack of a proper dataset. To address this gap, we introduce SensorQA, the first human-created question-answering (QA) dataset for long-term time-series sensor data for daily life monitoring. SensorQA is created by human workers and includes 5.6K diverse and practical queries that reflect genuine human interests, paired with accurate answers derived from sensor data. We further establish benchmarks for state-of-the-art AI models on this dataset and evaluate their performance on typical edge devices. Our results reveal a gap between current models and optimal QA performance and efficiency, highlighting the need for new contributions. The dataset and code are available at: https://github.com/benjamin-reichman/SensorQA.
中文: 本文提出首个针对长期传感器数据的人工构建问答数据集SensorQA,包含5.6千条查询以解决人类洞察提取的空白,并揭示了当前AI模型在性能与效率方面的不足。
English: This paper introduces SensorQA, the first human-curated question-answering dataset for long-term sensor data, featuring 5.6K queries to bridge the gap in extracting human-centric insights and revealing current AI models' limitations in performance and efficiency.

Authors:Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, Irwin King
Title: VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models
Abstract:
With the rising need for speech-based interaction models, end-to-end Spoken Language Models (SLMs) have emerged as a promising solution. While these models require comprehensive world knowledge for meaningful and reliable human interactions, existing question-answering (QA) benchmarks fall short in evaluating SLMs' knowledge understanding due to their inability to support end-to-end speech evaluation and account for varied input audio conditions. To address these limitations, we present VoxEval, a novel SpeechQA benchmark that assesses SLMs' knowledge understanding through pure speech interactions. Our benchmark 1) uniquely maintains speech format for both inputs and outputs, 2) evaluates model robustness across diverse input audio conditions, and 3) pioneers the assessment of complex tasks like mathematical reasoning in spoken format. Systematic evaluation demonstrates that VoxEval presents significant challenges to current SLMs, revealing their sensitivity to varying audio conditions and highlighting the need to enhance reasoning capabilities in future development. We hope this benchmark could guide the advancement of more sophisticated and reliable SLMs. VoxEval dataset is available at: https://github.com/dreamtheater123/VoxEval
中文:VoxEval是一种创新的语音问答基准,通过纯语音交互评估口语模型的知识理解能力,测试其在多样化音频条件和复杂推理任务中的鲁棒性。
English: VoxEval is a novel SpeechQA benchmark designed to evaluate spoken language models' knowledge understanding through pure speech interactions, testing their robustness across diverse audio conditions and complex reasoning tasks.

Authors:Qingyu Ren, Jie Zeng, Qianyu He, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Title: Step-by-Step Mastery: Enhancing Soft Constraint Following Ability of Large Language Models
Abstract:
It is crucial for large language models (LLMs) to follow instructions that involve multiple constraints. However, it is an unexplored area to enhance LLMs' ability to follow soft constraints. To bridge the gap, we initially design a pipeline to construct datasets with high-quality outputs automatically. Additionally, to fully utilize the positive and negative samples generated during the data construction process, we choose Direct Preference Optimization (DPO) as the training method. Furthermore, taking into account the difficulty of soft constraints indicated by the number of constraints, we design a curriculum learning training paradigm based on the constraint quantity. We experimentally evaluate the effectiveness of our methods in improving LLMs' soft constraint following ability and analyze the factors driving the improvements.The datasets and code are publicly available at https://github.com/Rainier-rq/FollowSoftConstraint.
中文: 本研究提出了一种自动构建高质量数据集的新流程,并采用直接偏好优化与课程学习相结合的方法,有效提升了大语言模型遵循软约束的能力,实验验证了其显著改进效果。
English: This study introduces a novel pipeline for automatically generating datasets and employs Direct Preference Optimization with curriculum learning to enhance large language models' ability to follow soft constraints, demonstrating significant improvements through experimental validation.

Authors:Long Mai, Julie Carson-Berndsen
Title: Real-Time Textless Dialogue Generation
Abstract:
Recent advancements in large language models (LLMs) have led to significant progress in text-based dialogue systems. These systems can now generate high-quality responses that are accurate and coherent across a wide range of topics and tasks. However, spoken dialogue systems still lag behind in terms of naturalness. They tend to produce robotic interactions, with issues such as slow response times, overly generic or cautious replies, and a lack of natural rhythm and fluid turn-taking. This shortcoming is largely due to the over-reliance on the traditional cascaded design, which involve separate, sequential components, as well as the use of text as an intermediate representation. This paper propose a real-time, textless spoken dialogue generation model (RTTL-DG) that aims to overcome these challenges. Our system enables fluid turn-taking and generates responses with minimal delay by processing streaming spoken conversation directly. Additionally, our model incorporates backchannels, filters, laughter, and other paralinguistic signals, which are often absent in cascaded dialogue systems, to create more natural and human-like interactions. The implementations and generated samples are available in our repository: https://github.com/mailong25/rts2s-dg
中文: 尽管大语言模型的进步提升了文本对话系统的表现,但语音对话系统因依赖级联设计和文本中间表示而缺乏自然性,为此本文提出一种实时无文本生成模型,通过直接处理语音流并融入副语言特征来实现更流畅自然的交互。
English: Recent advancements in large language models have improved text-based dialogue systems, but spoken dialogue systems still lack naturalness due to cascaded designs and text reliance, prompting the development of a real-time textless model that enables fluid interactions with paralinguistic signals.

Authors:Yaoxiang Wang, Haoling Li, Xin Zhang, Jie Wu, Xiao Liu, Wenxiang Hu, Zhongxin Guo, Yangyu Huang, Ying Xin, Yujiu Yang, Jinsong Su, Qi Chen, Scarlett Li
Title: EpiCoder: Encompassing Diversity and Complexity in Code Generation
Abstract:
Existing methods for code generation use code snippets as seed data, restricting the complexity and diversity of the synthesized data. In this paper, we introduce a novel feature tree-based synthesis framework, which revolves around hierarchical code features derived from high-level abstractions of code. The feature tree is constructed from raw data and refined iteratively to increase the quantity and diversity of the extracted features, which captures and recognizes more complex patterns and relationships within the code. By adjusting the depth and breadth of the sampled subtrees, our framework provides precise control over the complexity of the generated code, enabling functionalities that range from function-level operations to multi-file scenarios. We fine-tuned widely-used base models to obtain EpiCoder series, achieving state-of-the-art performance on multiple benchmarks at both the function and file levels. In particular, empirical evidence indicates that our approach shows significant potential in the synthesizing of repository-level code data. Our code and data are publicly available at https://github.com/microsoft/EpiCoder.
Chinese: 本文提出了一种基于特征树的合成框架,通过从代码高级抽象中迭代精炼层次特征来增强代码生成,实现了对复杂度的精确控制,并在多个基准测试中达到了最先进的性能。
English: This paper introduces a feature tree-based synthesis framework that enhances code generation by iteratively refining hierarchical features from high-level abstractions, enabling precise control over complexity and achieving state-of-the-art performance across multiple benchmarks.

Authors:Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Ruihang Chu, Jin Zeng, Yujiu Yang
Title: URSA: Understanding and Verifying Chain-of-thought Reasoning in Multimodal Mathematics
Abstract:
Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.
中文: 本研究提出URSA三阶段框架,通过构建高质量多模态推理数据集、开发自动化过程监督方法及创新强化学习算法,显著提升多模态数学推理能力,在多个基准测试中超越主流模型表现。
English: This study introduces URSA, a three-stage framework that enhances multimodal mathematical reasoning by developing a high-quality dataset, creating automated process supervision, and implementing a novel reinforcement learning method, achieving superior performance over leading models.

Authors:Tarek Naous, Wei Xu
Title: On The Origin of Cultural Biases in Language Models: From Pre-training Data to Linguistic Phenomena
Abstract:
Language Models (LMs) have been shown to exhibit a strong preference towards entities associated with Western culture when operating in non-Western languages. In this paper, we aim to uncover the origins of entity-related cultural biases in LMs by analyzing several contributing factors, including the representation of entities in pre-training data and the impact of variations in linguistic phenomena across languages. We introduce CAMeL-2, a parallel Arabic-English benchmark of 58,086 entities associated with Arab and Western cultures and 367 masked natural contexts for entities. Our evaluations using CAMeL-2 reveal reduced performance gaps between cultures by LMs when tested in English compared to Arabic. We find that LMs struggle in Arabic with entities that appear at high frequencies in pre-training, where entities can hold multiple word senses. This also extends to entities that exhibit high lexical overlap with languages that are not Arabic but use the Arabic script. Further, we show how frequency-based tokenization leads to this issue in LMs, which gets worse with larger Arabic vocabularies. We will make CAMeL-2 available at: https://github.com/tareknaous/camel2
中文:语言模型在非西方语言中表现出偏向西方实体的文化偏见,CAMeL-2基准测试显示,由于基于频率的分词和词汇歧义,模型在阿拉伯语中的表现差距更为明显。
English: Language models exhibit cultural biases favoring Western entities in non-Western languages, with CAMeL-2 benchmark revealing performance gaps in Arabic due to frequency-based tokenization and lexical ambiguities.

Authors:Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu
Title: InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
Abstract:
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce \textit{InfiGUIAgent}, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. \textit{InfiGUIAgent} achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at \url{https://github.com/Reallm-Labs/InfiGUIAgent}.
中文: InfiGUIAgent是一种基于多模态大语言模型的图形界面代理,通过两阶段微调训练具备原生推理能力,在多个基准测试中表现出色,提升了自动化任务的交互效果。
English: InfiGUIAgent, an MLLM-based GUI agent trained with a two-stage fine-tuning pipeline, enhances GUI interaction through native reasoning skills and achieves competitive performance on benchmarks.

Authors:Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang
Title: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Abstract:
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.
中文: rStar-Math 通过蒙特卡洛树搜索实现"深度思考",结合创新的数据合成、过程奖励模型和自进化方法,使小型语言模型在数学推理上超越 OpenAI o1,在 MATH 和 AIME 基准测试中达到顶尖水平。
English: rStar-Math demonstrates that small language models can outperform OpenAI o1 in math reasoning by employing Monte Carlo Tree Search for deep thinking, enhanced through innovations in data synthesis, process reward modeling, and iterative self-evolution, achieving state-of-the-art results on benchmarks like MATH and AIME.

Authors:Paweł Batorski, Jannik Brinkmann, Paul Swoboda
Title: NSA: Neuro-symbolic ARC Challenge
Abstract:
The Abstraction and Reasoning Corpus (ARC) evaluates general reasoning capabilities that are difficult for both machine learning models and combinatorial search methods. We propose a neuro-symbolic approach that combines a transformer for proposal generation with combinatorial search using a domain-specific language. The transformer narrows the search space by proposing promising search directions, which allows the combinatorial search to find the actual solution in short time. We pre-train the trainsformer with synthetically generated data. During test-time we generate additional task-specific training tasks and fine-tune our model. Our results surpass comparable state of the art on the ARC evaluation set by 27% and compare favourably on the ARC train set. We make our code and dataset publicly available at https://github.com/Batorskq/NSA.
中文: 本文提出了一种神经符号方法,结合Transformer生成搜索建议和组合搜索,在抽象与推理语料库任务中取得显著成效,性能超越现有最佳方法27%。
English: This paper introduces a neuro-symbolic method that uses a transformer to propose search directions and combinatorial search to efficiently solve tasks in the Abstraction and Reasoning Corpus, achieving a 27% improvement over state-of-the-art results.

Authors:Qiang Sun, Sirui Li, Du Huynh, Mark Reynolds, Wei Liu
Title: TimelineKGQA: A Comprehensive Question-Answer Pair Generator for Temporal Knowledge Graphs
Abstract:
Question answering over temporal knowledge graphs (TKGs) is crucial for understanding evolving facts and relationships, yet its development is hindered by limited datasets and difficulties in generating custom QA pairs. We propose a novel categorization framework based on timeline-context relationships, along with \textbf{TimelineKGQA}, a universal temporal QA generator applicable to any TKGs. The code is available at: \url{https://github.com/PascalSun/TimelineKGQA} as an open source Python package.
中文摘要:本研究提出了一种新的分类框架和TimelineKGQA通用生成器,用于时序知识图谱问答,旨在解决数据集限制并改进自定义问答对的生成。
English Summary: This study introduces a new categorization framework and TimelineKGQA, a universal generator for temporal question answering over knowledge graphs, addressing dataset limitations and enhancing custom QA pair creation.

Authors:Dong-Hai Zhu, Yu-Jie Xiong, Jia-Chen Zhang, Xi-Jiong Xie, Chun-Ming Xia
Title: Understanding Before Reasoning: Enhancing Chain-of-Thought with Iterative Summarization Pre-Prompting
Abstract:
Chain-of-Thought (CoT) Prompting is a dominant paradigm in Large Language Models (LLMs) to enhance complex reasoning. It guides LLMs to present multi-step reasoning, rather than generating the final answer directly. However, CoT encounters difficulties when key information required for reasoning is implicit or missing. This occurs because CoT emphasizes the sequence of reasoning steps while overlooking the early extraction of essential information. We propose a pre-prompting method called Iterative Summarization Pre-Prompting (ISP^2) to refine LLM reasoning when key information is not explicitly provided. First, entities and their corresponding descriptions are extracted to form potential key information pairs. Next, we use a reliability rating to assess these pairs, then merge the two lowest-ranked pairs into a new entity description. This process is repeated until a unique key information pair is obtained. Finally, that pair, along with the original question, is fed into LLMs to produce the answer. Extensive experiments demonstrate a 7.1% improvement compared to existing methods. Unlike traditional prompting, ISP^2 adopts an inductive approach with pre-prompting, offering flexible integration into diverse reasoning frameworks. The code is available at https://github.com/zdhgreat/ISP-2.
中文: 针对思维链提示在关键信息缺失或隐含时推理困难的问题,提出的迭代摘要预提示方法通过逐步提取和精炼关键信息对来优化大语言模型的推理能力,相比现有方法性能提升了7.1%。
English: Chain-of-Thought prompting struggles with implicit or missing key information in reasoning, so the proposed Iterative Summarization Pre-Prompting (ISP²) method enhances LLM performance by iteratively extracting and refining essential information pairs before generating answers, achieving a 7.1% improvement over existing methods.

Authors:Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, Xinya Du
Title: LLM4SR: A Survey on Large Language Models for Scientific Research
Abstract:
In recent years, the rapid advancement of Large Language Models (LLMs) has transformed the landscape of scientific research, offering unprecedented support across various stages of the research cycle. This paper presents the first systematic survey dedicated to exploring how LLMs are revolutionizing the scientific research process. We analyze the unique roles LLMs play across four critical stages of research: hypothesis discovery, experiment planning and implementation, scientific writing, and peer reviewing. Our review comprehensively showcases the task-specific methodologies and evaluation benchmarks. By identifying current challenges and proposing future research directions, this survey not only highlights the transformative potential of LLMs, but also aims to inspire and guide researchers and practitioners in leveraging LLMs to advance scientific inquiry. Resources are available at the following repository: https://github.com/du-nlp-lab/LLM4SR
中文: 本文首次系统综述了大语言模型如何通过推动假设发现、实验规划、科学写作和同行评审等关键环节来变革科研流程,同时指出了当前挑战与未来研究方向。
English: This paper provides the first systematic survey on how Large Language Models are revolutionizing scientific research by analyzing their roles in hypothesis discovery, experiment planning, scientific writing, and peer review, while identifying challenges and future directions.

Authors:Siddharth Joshi, Besmira Nushi, Vidhisha Balachandran, Varun Chandrasekaran, Vibhav Vineet, Neel Joshi, Baharan Mirzasoleiman
Title: MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
Abstract:
Vision-language models (VLMs) are highly effective but often underperform on specialized tasks; for example, Llava-1.5 struggles with chart and diagram understanding due to scarce task-specific training data. Existing training data, sourced from general-purpose datasets, fails to capture the nuanced details needed for these tasks. We introduce MM-Gen, a scalable method that generates task-specific, high-quality synthetic text for candidate images by leveraging stronger models. MM-Gen employs a three-stage targeted process: partitioning data into subgroups, generating targeted text based on task descriptions, and filtering out redundant and outlier data. Fine-tuning VLMs with data generated by MM-Gen leads to significant performance gains, including 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5 (7B). Compared to human-curated caption data, MM-Gen achieves up to 1.6x better improvements for the original models, proving its effectiveness in enhancing task-specific VLM performance and bridging the gap between general-purpose datasets and specialized requirements. Code available at https://github.com/sjoshi804/MM-Gen.
Chinese: MM-Gen是一种可扩展的方法,通过为图像生成高质量的合成文本来提升视觉语言模型在专业任务上的性能,例如使Llava-1.5在空间推理和图表理解方面分别实现了29%和15%的显著提升。
English: MM-Gen is a scalable method that generates high-quality synthetic text for images to enhance vision-language models' performance on specialized tasks, achieving significant improvements such as 29% on spatial reasoning and 15% on diagram understanding for Llava-1.5.

Authors:Xiaoqing Zhang, Ang Lv, Yuhan Liu, Flood Sung, Wei Liu, Jian Luan, Shuo Shang, Xiuying Chen, Rui Yan
Title: More is not always better? Enhancing Many-Shot In-Context Learning with Differentiated and Reweighting Objectives
Abstract:
Large language models (LLMs) excel at few-shot in-context learning (ICL) without requiring parameter updates. However, as ICL demonstrations increase from a few to many, performance tends to plateau and eventually decline. We identify two primary causes for this trend: the suboptimal negative log-likelihood (NLL) optimization objective and the incremental data noise. To address these issues, we introduce \textit{DrICL}, a novel optimization method that enhances model performance through \textit{Differentiated} and \textit{Reweighting} objectives. Globally, DrICL utilizes differentiated learning to optimize the NLL objective, ensuring that many-shot performance surpasses zero-shot levels. Locally, it dynamically adjusts the weighting of many-shot demonstrations by leveraging cumulative advantages inspired by reinforcement learning, thereby mitigating the impact of noisy data. Recognizing the lack of multi-task datasets with diverse many-shot distributions, we develop the \textit{Many-Shot ICL Benchmark} (ICL-50)-a large-scale benchmark of 50 tasks that cover shot numbers from 1 to 350 within sequences of up to 8,000 tokens-for both fine-tuning and evaluation purposes. Experimental results demonstrate that LLMs enhanced with DrICL achieve significant improvements in many-shot setups across various tasks, including both in-domain and out-of-domain scenarios. We release the code and dataset hoping to facilitate further research in many-shot ICL\footnote{https://github.com/xiaoqzhwhu/DrICL}.
中文: 大语言模型在多样本上下文学习中因优化目标不理想和数据噪声导致性能下降,而提出的DrICL方法通过差异化学习和动态加权机制有效解决这些问题,在新构建的基准测试中显著提升了多样本场景下的任务表现。
English: Large language models face performance degradation in many-shot in-context learning due to suboptimal optimization objectives and data noise, which the proposed DrICL method addresses through differentiated learning and dynamic demonstration reweighting, achieving significant improvements across tasks as validated on a newly developed benchmark.

Authors:Hao Zheng, Xinyan Guan, Hao Kong, Jia Zheng, Weixiang Zhou, Hongyu Lin, Yaojie Lu, Ben He, Xianpei Han, Le Sun
Title: PPTAgent: Generating and Evaluating Presentations Beyond Text-to-Slides
Abstract:
Automatically generating presentations from documents is a challenging task that requires accommodating content quality, visual appeal, and structural coherence. Existing methods primarily focus on improving and evaluating the content quality in isolation, overlooking visual appeal and structural coherence, which limits their practical applicability. To address these limitations, we propose PPTAgent, which comprehensively improves presentation generation through a two-stage, edit-based approach inspired by human workflows. PPTAgent first analyzes reference presentations to extract slide-level functional types and content schemas, then drafts an outline and iteratively generates editing actions based on selected reference slides to create new slides. To comprehensively evaluate the quality of generated presentations, we further introduce PPTEval, an evaluation framework that assesses presentations across three dimensions: Content, Design, and Coherence. Results demonstrate that PPTAgent significantly outperforms existing automatic presentation generation methods across all three dimensions.
中文摘要:PPTAgent采用基于编辑的两阶段方法,通过分析参考幻灯片并迭代生成新幻灯片,在内容、设计和连贯性三个维度上均显著优于现有自动演示文稿生成方法,并通过PPTEval框架进行全面评估。
English Summary: PPTAgent is a two-stage, edit-based approach that enhances presentation generation by analyzing reference slides and iteratively creating new ones, significantly outperforming existing methods in content, design, and coherence as evaluated by the PPTEval framework.

Authors:Jiakang Yuan, Xiangchao Yan, Shiyang Feng, Bo Zhang, Tao Chen, Botian Shi, Wanli Ouyang, Yu Qiao, Lei Bai, Bowen Zhou
Title: Dolphin: Moving Towards Closed-loop Auto-research through Thinking, Practice, and Feedback
Abstract:
The scientific research paradigm is undergoing a profound transformation owing to the development of Artificial Intelligence (AI). Recent works demonstrate that various AI-assisted research methods can largely improve research efficiency by improving data analysis, accelerating computation, and fostering novel idea generation. To further move towards the ultimate goal (i.e., automatic scientific research), in this paper, we introduce Dolphin, a closed-loop LLM-driven framework to enhance the automation level of scientific research. Dolphin first generates novel ideas based on feedback from previous experiments and relevant papers ranked by the topic and task attributes. Then, the generated ideas can be implemented using a code template refined and debugged with the designed exception-traceback-guided local code structure. Finally, Dolphin automatically analyzes the results of each idea and feeds the results back to the next round of idea generation. Experiments are conducted on the benchmark datasets of different topics and a subset of MLE-bench. Results show that Dolphin can continuously improve the performance of the input topic in a loop. We highlight that Dolphin can automatically propose methods that are comparable to the state-of-the-art in some tasks such as 3D point classification.
Chinese: Dolphin框架提出了一种闭环、大语言模型驱动的系统,通过生成想法、实施代码和分析结果来自动化科学研究,在3D点分类等任务中展现出与最先进方法相媲美的性能提升。
English: The Dolphin framework introduces a closed-loop, LLM-driven system that automates scientific research by generating ideas, implementing code, and analyzing results, demonstrating performance improvements comparable to state-of-the-art methods in tasks like 3D point classification.

Authors:Shaolei Zhang, Qingkai Fang, Zhe Yang, Yang Feng
Title: LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Abstract:
The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and textual instructions into the context of large language models (LLMs), where large-scale parameters and numerous context tokens (predominantly vision tokens) result in substantial computational overhead. Previous efforts towards efficient LMMs always focus on replacing the LLM backbone with smaller models, while neglecting the crucial issue of token quantity. In this paper, we introduce LLaVA-Mini, an efficient LMM with minimal vision tokens. To achieve a high compression ratio of vision tokens while preserving visual information, we first analyze how LMMs understand vision tokens and find that most vision tokens only play a crucial role in the early layers of LLM backbone, where they mainly fuse visual information into text tokens. Building on this finding, LLaVA-Mini introduces modality pre-fusion to fuse visual information into text tokens in advance, thereby facilitating the extreme compression of vision tokens fed to LLM backbone into one token. LLaVA-Mini is a unified large multimodal model that can support the understanding of images, high-resolution images, and videos in an efficient manner. Experiments across 11 image-based and 7 video-based benchmarks demonstrate that LLaVA-Mini outperforms LLaVA-v1.5 with just 1 vision token instead of 576. Efficiency analyses reveal that LLaVA-Mini can reduce FLOPs by 77%, deliver low-latency responses within 40 milliseconds, and process over 10,000 frames of video on the GPU hardware with 24GB of memory.
中文: LLaVA-Mini是一种高效的大型多模态模型,通过模态预融合将视觉令牌压缩至仅一个,在显著降低计算开销的同时,保持了在图像和视频基准测试中的优异性能。
English: LLaVA-Mini is an efficient large multimodal model that drastically reduces computational overhead by compressing vision tokens to just one through modality pre-fusion, while maintaining high performance across image and video benchmarks.

Authors:Yindu Su, Huike Zou, Lin Sun, Ting Zhang, Haiyang Yang, Liyu Chen, David Lo, Qingheng Zhang, Shuguang Han, Jufeng Chen
Title: TACLR: A Scalable and Efficient Retrieval-based Method for Industrial Product Attribute Value Identification
Abstract:
Product Attribute Value Identification (PAVI) involves identifying attribute values from product profiles, a key task for improving product search, recommendation, and business analytics on e-commerce platforms. However, existing PAVI methods face critical challenges, such as inferring implicit values, handling out-of-distribution (OOD) values, and producing normalized outputs. To address these limitations, we introduce Taxonomy-Aware Contrastive Learning Retrieval (TACLR), the first retrieval-based method for PAVI. TACLR formulates PAVI as an information retrieval task by encoding product profiles and candidate values into embeddings and retrieving values based on their similarity. It leverages contrastive training with taxonomy-aware hard negative sampling and employs adaptive inference with dynamic thresholds. TACLR offers three key advantages: (1) it effectively handles implicit and OOD values while producing normalized outputs; (2) it scales to thousands of categories, tens of thousands of attributes, and millions of values; and (3) it supports efficient inference for high-load industrial deployment. Extensive experiments on proprietary and public datasets validate the effectiveness and efficiency of TACLR. Further, it has been successfully deployed on the real-world e-commerce platform Xianyu, processing millions of product listings daily with frequently updated, large-scale attribute taxonomies. We release the code to facilitate reproducibility and future research at https://github.com/SuYindu/TACLR.
中文: TACLR是一种基于检索的产品属性值识别新方法,通过对比学习和自适应推理有效解决了隐式值和分布外值等难题,实现了在电商平台上的可扩展高效部署。
English: TACLR is a novel retrieval-based method for product attribute value identification that overcomes challenges like implicit and out-of-distribution values through contrastive learning and adaptive inference, enabling scalable and efficient deployment on e-commerce platforms.

Authors:Jiayao Gu, Liting Chen, Yihong Li
Title: Investigating the Impact of Data Selection Strategies on Language Model Performance
Abstract:
Data selection is critical for enhancing the performance of language models, particularly when aligning training datasets with a desired target distribution. This study explores the effects of different data selection methods and feature types on model performance. We evaluate whether selecting data subsets can influence downstream tasks, whether n-gram features improve alignment with target distributions, and whether embedding-based neural features provide complementary benefits. Through comparative experiments using baseline random selection methods and distribution aligned approaches, we provide insights into the interplay between data selection strategies and model training efficacy. All code for this study can be found on \href{https://github.com/jgu13/HIR-Hybrid-Importance-Resampling-for-Language-Models}{github repository}.
Chinese: 本研究通过对比实验探讨不同数据选择方法和特征类型如何影响语言模型性能,分析其与目标分布的匹配程度及对下游任务的作用。
English: This study investigates how various data selection methods and feature types impact language model performance, examining their alignment with target distributions and effects on downstream tasks through comparative experiments.

Authors:Avishai Elmakies, Omri Abend, Yossi Adi
Title: Unsupervised Speech Segmentation: A General Approach Using Speech Language Models
Abstract:
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at https://github.com/avishaiElmakies/unsupervised_speech_segmentation_using_slm.
中文: 本文提出了一种无监督语音分割方法,利用语音语言模型处理多种声学语义风格变化,在边界检测和分段纯度方面优于基线方法。
English: This paper presents an unsupervised speech segmentation method that leverages Speech Language Models to handle multiple acoustic-semantic style changes, outperforming baselines in boundary detection and segment purity.

Authors:Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, Marina Danilevsky
Title: MTRAG: A Multi-Turn Conversational Benchmark for Evaluating Retrieval-Augmented Generation Systems
Abstract:
Retrieval-augmented generation (RAG) has recently become a very popular task for Large Language Models (LLMs). Evaluating them on multi-turn RAG conversations, where the system is asked to generate a response to a question in the context of a preceding conversation is an important and often overlooked task with several additional challenges. We present MTRAG: an end-to-end human-generated multi-turn RAG benchmark that reflects several real-world properties across diverse dimensions for evaluating the full RAG pipeline. MTRAG contains 110 conversations averaging 7.7 turns each across four domains for a total of 842 tasks. We also explore automation paths via synthetic data and LLM-as-a-Judge evaluation. Our human and automatic evaluations show that even state-of-the-art LLM RAG systems struggle on MTRAG. We demonstrate the need for strong retrieval and generation systems that can handle later turns, unanswerable questions, non-standalone questions, and multiple domains. MTRAG is available at https://github.com/ibm/mt-rag-benchmark.
中文: MTRAG是一个人工构建的多轮RAG评测基准,揭示了当前最先进的大语言模型在处理跨领域复杂对话场景时面临的挑战。
English: MTRAG is a human-generated multi-turn RAG benchmark that reveals the challenges state-of-the-art LLMs face in handling complex conversational contexts across diverse domains.

Authors:Pengwei Tang, Xiaolin Hu, Yong Liu
Title: ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning
Abstract:
Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the position-based token embedding offsets of DePT restrict its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce Adaptive Decomposed Prompt Tuning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that vary according to the model input and better optimization of token embedding offsets. This enables ADePT to achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 23 natural language processing tasks and 4 typical PLMs of different scales, ADePT consistently surpasses the other leading parameter-efficient fine-tuning methods, and even outperforms the full fine-tuning in certain scenarios. We also provide a theoretical analysis towards ADePT. Code is available at https://github.com/HungerPWAY/ADePT.
中文摘要:自适应分解提示调优(ADePT)通过采用令牌共享前馈神经网络生成自适应嵌入偏移,在多种自然语言处理任务中实现了卓越性能,且相比现有方法无需增加推理时间或可训练参数。
English Summary: Adaptive Decomposed Prompt Tuning (ADePT) enhances prompt tuning by using a token-shared feed-forward network to generate adaptive embedding offsets, achieving superior performance across diverse NLP tasks without increasing inference time or parameters compared to existing methods.

Authors:Peihai Jiang, Xixiang Lyu, Yige Li, Jing Ma
Title: Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models
Abstract:
Supervised fine-tuning has become the predominant method for adapting large pretrained models to downstream tasks. However, recent studies have revealed that these models are vulnerable to backdoor attacks, where even a small number of malicious samples can successfully embed backdoor triggers into the model. While most existing defense methods focus on post-training backdoor defense, efficiently defending against backdoor attacks during training phase remains largely unexplored. To address this gap, we propose a novel defense method called Backdoor Token Unlearning (BTU), which proactively detects and neutralizes trigger tokens during the training stage. Our work is based on two key findings: 1) backdoor learning causes distinctive differences between backdoor token parameters and clean token parameters in word embedding layers, and 2) the success of backdoor attacks heavily depends on backdoor token parameters. The BTU defense leverages these properties to identify aberrant embedding parameters and subsequently removes backdoor behaviors using a fine-grained unlearning technique. Extensive evaluations across three datasets and four types of backdoor attacks demonstrate that BTU effectively defends against these threats while preserving the model's performance on primary tasks. Our code is available at https://github.com/XDJPH/BTU.
中文摘要:本文提出了一种名为后门令牌遗忘(BTU)的新型防御方法,通过在训练阶段主动检测并消除嵌入层中的异常令牌参数,并采用细粒度遗忘技术来有效防御后门攻击,同时保持模型主要任务的性能。
English Summary: This paper introduces Backdoor Token Unlearning (BTU), a novel defense method that proactively detects and neutralizes backdoor triggers during supervised fine-tuning by identifying aberrant token parameters in embedding layers and applying fine-grained unlearning techniques.

Authors:Jian Hu, Jason Klein Liu, Haotian Xu, Wei Shen
Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Abstract:
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT or GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking and may be biased. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the global advantage normalization which is unbiased to improve the training stability. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
中文: REINFORCE++是一种创新的强化学习方法,它通过移除评论家网络并采用全局优势归一化来提高训练稳定性,在使大语言模型与人类偏好对齐方面展现出鲁棒性能和卓越的泛化能力。
English: REINFORCE++ is a novel reinforcement learning method that eliminates the critic network and employs global advantage normalization to enhance training stability, demonstrating robust performance and superior generalization in aligning large language models with human preferences.

Authors:Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang
Title: BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Abstract:
Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to deliver exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o's CoT performance by 4.6% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B's performance on AIME by 2.2%, leveraging simple examples only from the MATH dataset.
中文摘要:BoostStep通过步骤对齐的上下文学习机制和有效示范策略,显著提升大语言模型在数学推理中的准确性,在多项测试中取得突破性性能提升。
English Summary: BoostStep enhances large language models' mathematical reasoning by aligning in-context learning examples with specific reasoning steps and using relevant exemplars, significantly improving accuracy across benchmarks.

Authors:Libing Yuan, Shuaibo Hu, Kui Yu, Le Wu
Title: Boosting Explainability through Selective Rationalization in Pre-trained Language Models
Abstract:
The widespread application of pre-trained language models (PLMs) in natural language processing (NLP) has led to increasing concerns about their explainability. Selective rationalization is a self-explanatory framework that selects human-intelligible input subsets as rationales for predictions. Recent studies have shown that applying existing rationalization frameworks to PLMs will result in severe degeneration and failure problems, producing sub-optimal or meaningless rationales. Such failures severely damage trust in rationalization methods and constrain the application of rationalization techniques on PLMs. In this paper, we find that the homogeneity of tokens in the sentences produced by PLMs is the primary contributor to these problems. To address these challenges, we propose a method named Pre-trained Language Model's Rationalization (PLMR), which splits PLMs into a generator and a predictor to deal with NLP tasks while providing interpretable rationales. The generator in PLMR also alleviates homogeneity by pruning irrelevant tokens, while the predictor uses full-text information to standardize predictions. Experiments conducted on two widely used datasets across multiple PLMs demonstrate the effectiveness of the proposed method PLMR in addressing the challenge of applying selective rationalization to PLMs. Codes: https://github.com/ylb777/PLMR.
Chinese: 预训练语言模型因标记同质性常产生不可靠解释,而PLMR方法通过分离生成与预测有效解决了这一问题,提供了可解释的合理化方案。
English: Pre-trained language models often produce unreliable rationales due to token homogeneity, but the proposed PLMR method effectively addresses this by separating generation and prediction to provide interpretable explanations.

Authors:Ali Al-Lawati, Jason Lucas, Prasenjit Mitra
Title: Semantic Captioning: Benchmark Dataset and Graph-Aware Few-Shot In-Context Learning for SQL2Text
Abstract:
Large Language Models (LLMs) have demonstrated remarkable performance in various NLP tasks, including semantic parsing, which translates natural language into formal code representations. However, the reverse process, translating code into natural language, termed semantic captioning, has received less attention. This task is becoming increasingly important as LLMs are integrated into platforms for code generation, security analysis, and educational purposes. In this paper, we focus on the captioning of SQL query (SQL2Text) to address the critical need for understanding and explaining SQL queries in an era where LLM-generated code poses potential security risks. We repurpose Text2SQL datasets for SQL2Text by introducing an iterative ICL prompt using GPT-4o to generate multiple additional utterances, which enhances the robustness of the datasets for the reverse task. We conduct our experiments using in-context learning (ICL) based on different sample selection methods, emphasizing smaller, more computationally efficient LLMs. Our findings demonstrate that leveraging the inherent graph properties of SQL for ICL sample selection significantly outperforms random selection by up to 39% on BLEU score and provides better results than alternative methods. Dataset and codes are published: https://github.com/aliwister/ast-icl.
中文: 本研究针对将SQL查询转化为自然语言的语义描述任务,通过GPT-4o增强数据集并证明基于图结构的示例选择方法能显著提升小规模语言模型的性能表现,效果优于随机选择达39%。
English: This study addresses the semantic captioning task of translating SQL queries into natural language (SQL2Text) by repurposing datasets with GPT-4o-generated utterances and demonstrating that graph-based sample selection for in-context learning significantly outperforms random methods, especially in smaller LLMs.

Authors:Dylan Bouchard, Mohit Singh Chauhan, David Skarbrevik, Viren Bajaj, Zeya Ahmad
Title: LangFair: A Python Package for Assessing Bias and Fairness in Large Language Model Use Cases
Abstract:
Large Language Models (LLMs) have been observed to exhibit bias in numerous ways, potentially creating or worsening outcomes for specific groups identified by protected attributes such as sex, race, sexual orientation, or age. To help address this gap, we introduce LangFair, an open-source Python package that aims to equip LLM practitioners with the tools to evaluate bias and fairness risks relevant to their specific use cases. The package offers functionality to easily generate evaluation datasets, comprised of LLM responses to use-case-specific prompts, and subsequently calculate applicable metrics for the practitioner's use case. To guide in metric selection, LangFair offers an actionable decision framework.
Chinese: LangFair 是一个开源 Python 工具包,旨在帮助大语言模型从业者通过生成评估数据集和计算相关指标来评估偏见与公平性风险,并提供可操作的指标选择框架。
English: LangFair is an open-source Python package designed to help LLM practitioners assess bias and fairness risks by enabling easy generation of evaluation datasets and calculation of relevant metrics, supported by a decision framework for metric selection.

Authors:Duygu Sezen Islakoglu, Jan-Christoph Kalo
Title: ChronoSense: Exploring Temporal Understanding in Large Language Models with Time Intervals of Events
Abstract:
Large Language Models (LLMs) have achieved remarkable success in various NLP tasks, yet they still face significant challenges in reasoning and arithmetic. Temporal reasoning, a critical component of natural language understanding, has raised increasing research attention. However, comprehensive testing of Allen's interval relations (e.g., before, after, during) -- a fundamental framework for temporal relationships -- remains underexplored. To fill this gap, we present ChronoSense, a new benchmark for evaluating LLMs' temporal understanding. It includes 16 tasks, focusing on identifying the Allen relation between two temporal events and temporal arithmetic, using both abstract events and real-world data from Wikidata. We assess the performance of seven recent LLMs using this benchmark and the results indicate that models handle Allen relations, even symmetrical ones, quite differently. Moreover, the findings suggest that the models may rely on memorization to answer time-related questions. Overall, the models' low performance highlights the need for improved temporal understanding in LLMs and ChronoSense offers a robust framework for future research in this area. Our dataset and the source code are available at https://github.com/duyguislakoglu/chronosense.
中文: 大型语言模型在时间推理方面存在不足,因此开发了ChronoSense基准测试,揭示其表现不稳定且依赖记忆,凸显了提升时间理解能力的必要性。
English: Large Language Models struggle with temporal reasoning, prompting the creation of ChronoSense, a benchmark that reveals their inconsistent performance and reliance on memorization, underscoring the need for enhanced temporal understanding.

Authors:Xiang Zheng, Longxiang Wang, Yi Liu, Xingjun Ma, Chao Shen, Cong Wang
Title: CALM: Curiosity-Driven Auditing for Large Language Models
Abstract:
Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.
中文摘要:本研究提出CALM方法,通过基于好奇心的强化学习训练审计代理,在无法获取内部参数的黑盒大语言模型中自动识别有害或有偏见的输入输出组合。
English Summary: This study introduces CALM, a curiosity-driven auditing method using reinforcement learning to automatically detect harmful or biased input-output pairs in black-box large language models without accessing their internal parameters.

Authors:Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe
Title: Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation
Abstract:
The multilingual neural machine translation (MNMT) aims for arbitrary translations across multiple languages. Although MNMT-specific models trained on parallel data offer low costs in training and deployment, their performance consistently lags behind that of large language models (LLMs). In this work, we introduce registering, a novel method that enables a small MNMT-specific model to compete with LLMs. Specifically, we insert a set of artificial tokens specifying the target language, called registers, into the input sequence between the source and target tokens. By modifying the attention mask, the target token generation only pays attention to the activation of registers, representing the source tokens in the target language space. Experiments on EC-40, a large-scale benchmark, show that our method advances the state-of-the-art of MNMT. We further pre-train two models, namely MITRE (multilingual translation with registers), by 9.3 billion sentence pairs across 24 languages collected from public corpora. One of them, MITRE-913M, outperforms NLLB-3.3B, achieves comparable performance with commercial LLMs, and shows strong adaptability in fine-tuning. Finally, we open-source our models to facilitate further research and development in MNMT: https://github.com/zhiqu22/mitre.
Chinese Summary: 本研究提出“注册”方法,通过插入人工标记引导目标语言生成,使小型多语言神经机器翻译模型能够与大型语言模型竞争,并在EC-40基准测试中取得领先性能。
English Summary: This study introduces "registering," a method that enhances small multilingual neural machine translation (MNMT) models by inserting artificial tokens to guide target language generation, enabling them to compete with large language models (LLMs) and achieving state-of-the-art results on the EC-40 benchmark.

Authors:Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang
Title: GeAR: Generation Augmented Retrieval
Abstract:
Document retrieval techniques are essential for developing large-scale information systems. The common approach involves using a bi-encoder to compute the semantic similarity between a query and documents. However, the scalar similarity often fail to reflect enough information, hindering the interpretation of retrieval results. In addition, this process primarily focuses on global semantics, overlooking the finer-grained semantic relationships between the query and the document's content. In this paper, we introduce a novel method, $\textbf{Ge}$neration $\textbf{A}$ugmented $\textbf{R}$etrieval ($\textbf{GeAR}$), which not only improves the global document-query similarity through contrastive learning, but also integrates well-designed fusion and decoding modules. This enables GeAR to generate relevant context within the documents based on a given query, facilitating learning to retrieve local fine-grained information. Furthermore, when used as a retriever, GeAR does not incur any additional computational cost over bi-encoders. GeAR exhibits competitive retrieval performance across diverse scenarios and tasks. Moreover, qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released at \href{https://github.com/microsoft/LMOps}{https://github.com/microsoft/LMOps}.
Chinese: 本文提出GeAR这一新颖文档检索方法,通过对比学习提升全局语义相似度,并生成细粒度上下文信息而不增加计算成本,在多种任务中展现出竞争优势。
English: This paper introduces GeAR, a novel document retrieval method that enhances global semantic similarity through contrastive learning and generates fine-grained contextual information without extra computational cost, demonstrating competitive performance across various tasks.

Authors:Yang Ouyang, Hengrui Gu, Shuhang Lin, Wenyue Hua, Jie Peng, Bhavya Kailkhura, Meijun Gao, Tianlong Chen, Kaixiong Zhou
Title: Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense
Abstract:
As large language models (LLMs) are increasingly deployed in diverse applications, including chatbot assistants and code generation, aligning their behavior with safety and ethical standards has become paramount. However, jailbreak attacks, which exploit vulnerabilities to elicit unintended or harmful outputs, threaten LLMs' safety significantly. In this paper, we introduce Layer-AdvPatcher, a novel methodology designed to defend against jailbreak attacks by utilizing an unlearning strategy to patch specific layers within LLMs through self-augmented datasets. Our insight is that certain layer(s), tend to produce affirmative tokens when faced with harmful prompts. By identifying these layers and adversarially exposing them to generate more harmful data, one can understand their inherent and diverse vulnerabilities to attacks. With these exposures, we then "unlearn" these issues, reducing the impact of affirmative tokens and hence minimizing jailbreak risks while keeping the model's responses to safe queries intact. We conduct extensive experiments on two models, four benchmark datasets, and multiple state-of-the-art jailbreak attacks to demonstrate the efficacy of our approach. Results indicate that our framework reduces the harmfulness and attack success rate of jailbreak attacks without compromising utility for benign queries compared to recent defense methods. Our code is publicly available at: https://github.com/oyy2000/LayerAdvPatcher
中文: Layer-AdvPatcher通过逆向学习修补大语言模型中的脆弱层,在保持良性查询功能的同时显著降低了越狱攻击的成功率和危害性。
English: Layer-AdvPatcher is a defense method that patches vulnerable layers in LLMs through adversarial unlearning, effectively reducing jailbreak risks while maintaining model utility for safe queries.

Authors:Jiaping Wang, Simiao Zhang, Qiao-Chu He, Yifan Chen
Title: LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations
Abstract:
The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with \emph{exponentially decaying causal linear attention}. In this paper, we present LeetDecoding, which is the first Python package that provides a large set of computation routines for this fundamental operator. The launch of LeetDecoding was motivated by the current lack of (1) clear understanding of the complexity regarding this operator, (2) a comprehensive collection of existing computation methods (usually spread in seemingly unrelated fields), and (3) CUDA implementations for fast inference on GPU. LeetDecoding's design is easy to integrate with existing linear-attention LLMs, and allows for researchers to benchmark and evaluate new computation methods for exponentially decaying causal linear attention. The usage of LeetDecoding does not require any knowledge of GPU programming and the underlying complexity analysis, intentionally making LeetDecoding accessible to LLM practitioners. The source code of LeetDecoding is provided at \href{https://github.com/Computational-Machine-Intelligence/LeetDecoding}{this GitHub repository}, and users can simply install LeetDecoding by the command \texttt{pip install leet-decoding}.
中文: LeetDecoding是首个提供指数衰减因果线性注意力计算功能的Python工具包,旨在解决该算子理解不足、方法分散和CUDA实现缺失的问题,支持无缝集成与性能评估,且无需GPU编程知识即可使用。
English: LeetDecoding is the first Python package offering comprehensive computation routines for exponentially decaying causal linear attention in transformers, addressing the lack of understanding, method collection, and CUDA implementations while enabling easy integration and benchmarking without requiring GPU programming expertise.

Authors:Jaeyoung Kim, Jongho Lee, Hong-Jun Choi, Ting-Yao Hsu, Chieh-Yang Huang, Sungchul Kim, Ryan Rossi, Tong Yu, Clyde Lee Giles, Ting-Hao 'Kenneth' Huang, Sungchul Choi
Title: Multi-LLM Collaborative Caption Generation in Scientific Documents
Abstract:
Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as either an image-to-text or text summarization problem. This limitation hinders the generation of high-quality captions that fully capture the necessary details. Moreover, existing data sourced from arXiv papers contain low-quality captions, posing significant challenges for training large language models (LLMs). In this paper, we introduce a framework called Multi-LLM Collaborative Figure Caption Generation (MLBCAP) to address these challenges by leveraging specialized LLMs for distinct sub-tasks. Our approach unfolds in three key modules: (Quality Assessment) We utilize multimodal LLMs to assess the quality of training data, enabling the filtration of low-quality captions. (Diverse Caption Generation) We then employ a strategy of fine-tuning/prompting multiple LLMs on the captioning task to generate candidate captions. (Judgment) Lastly, we prompt a prominent LLM to select the highest quality caption from the candidates, followed by refining any remaining inaccuracies. Human evaluations demonstrate that informative captions produced by our approach rank better than human-written captions, highlighting its effectiveness. Our code is available at https://github.com/teamreboott/MLBCAP
中文摘要:MLBCAP框架通过协同多个专用大语言模型,实现了训练数据质量评估、多样化描述生成与最优描述筛选的三步优化,在科学图表标注任务中生成比人工撰写更优质的结果。
English Summary: The MLBCAP framework enhances scientific figure captioning by employing multiple specialized LLMs to filter low-quality data, generate diverse candidate captions, and select the most accurate description, outperforming human-written captions in evaluations.

Authors:Zhe Chen, Yusheng Liao, Shuyang Jiang, Pingjie Wang, Yiqiu Guo, Yanfeng Wang, Yu Wang
Title: Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications
Abstract:
Large language models hold promise for addressing medical challenges, such as medical diagnosis reasoning, research knowledge acquisition, clinical decision-making, and consumer health inquiry support. However, they often generate hallucinations due to limited medical knowledge. Incorporating external knowledge is therefore critical, which necessitates multi-source knowledge acquisition. We address this challenge by framing it as a source planning problem, which is to formulate context-appropriate queries tailored to the attributes of diverse sources. Existing approaches either overlook source planning or fail to achieve it effectively due to misalignment between the model's expectation of the sources and their actual content. To bridge this gap, we present MedOmniKB, a repository comprising multigenre and multi-structured medical knowledge sources. Leveraging these sources, we propose the Source Planning Optimisation method, which enhances multi-source utilisation. Our approach involves enabling an expert model to explore and evaluate potential plans while training a smaller model to learn source alignment. Experimental results demonstrate that our method substantially improves multi-source planning performance, enabling the optimised small model to achieve state-of-the-art results in leveraging diverse medical knowledge sources.
中文摘要:大语言模型虽能处理医疗任务,但因知识有限常产生错误,为此我们构建了MedOmniKB知识库并提出源规划优化方法,显著提升了多源知识利用效率,在医疗应用中取得了最优效果。
English Summary: Large language models can tackle medical tasks but often produce errors due to limited knowledge, so we developed MedOmniKB and a source planning optimization method to improve multi-source knowledge integration, achieving top performance in medical applications.

Authors:Binh-Nguyen Nguyen, Yang He
Title: Swift Cross-Dataset Pruning: Enhancing Fine-Tuning Efficiency in Natural Language Understanding
Abstract:
Dataset pruning aims to select a subset of a dataset for efficient model training. While data efficiency in natural language processing has primarily focused on within-corpus scenarios during model pre-training, efficient dataset pruning for task-specific fine-tuning across diverse datasets remains challenging due to variability in dataset sizes, data distributions, class imbalance and label spaces. Current cross-dataset pruning techniques for fine-tuning often rely on computationally expensive sample ranking processes, typically requiring full dataset training or reference models. We address this gap by proposing Swift Cross-Dataset Pruning (SCDP). Specifically, our approach uses TF-IDF embeddings with geometric median to rapidly evaluate sample importance. We then apply dataset size-adaptive pruning to ensure diversity: for smaller datasets, we retain samples far from the geometric median, while for larger ones, we employ distance-based stratified pruning. Experimental results on six diverse datasets demonstrate the effectiveness of our method, spanning various tasks and scales while significantly reducing computational resources. Source code is available at: https://github.com/he-y/NLP-Dataset-Pruning
中文: 本文提出Swift跨数据集剪枝方法,利用TF-IDF嵌入和几何中位数快速评估样本重要性,根据数据集规模自适应调整剪枝策略以保持多样性,在显著降低计算资源的同时在六个不同数据集上验证了有效性。
English: This paper introduces Swift Cross-Dataset Pruning (SCDP), a method that uses TF-IDF embeddings and geometric median calculations to efficiently prune datasets for task-specific fine-tuning, adapting strategies based on dataset size to maintain diversity while reducing computational costs.

Authors:Tara Radvand, Mojtaba Abdolmaleki, Mohamed Mostagir, Ambuj Tewari
Title: Zero-Shot Statistical Tests for LLM-Generated Text Detection using Finite Sample Concentration Inequalities
Abstract:
Verifying the provenance of content is crucial to the function of many organizations, e.g., educational institutions, social media platforms, firms, etc. This problem is becoming increasingly challenging as text generated by Large Language Models (LLMs) becomes almost indistinguishable from human-generated content. In addition, many institutions utilize in-house LLMs and want to ensure that external, non-sanctioned LLMs do not produce content within the institution. In this paper, we answer the following question: Given a piece of text, can we identify whether it was produced by a particular LLM or not? We model LLM-generated text as a sequential stochastic process with complete dependence on history. We then design zero-shot statistical tests to (i) distinguish between text generated by two different known sets of LLMs $A$ (non-sanctioned) and $B$ (in-house), and (ii) identify whether text was generated by a known LLM or generated by any unknown model, e.g., a human or some other language generation process. We prove that the type I and type II errors of our test decrease exponentially with the length of the text. For that, we show that if $B$ generates the text, then except with an exponentially small probability in string length, the log-perplexity of the string under $A$ converges to the average cross-entropy of $B$ and $A$. We then present experiments using LLMs with white-box access to support our theoretical results and empirically examine the robustness of our results to black-box settings and adversarial attacks. In the black-box setting, our method achieves an average TPR of 82.5\% at a fixed FPR of 5\%. Under adversarial perturbations, our minimum TPR is 48.6\% at the same FPR threshold. Both results outperform all non-commercial baselines. See https://github.com/TaraRadvand74/llm-text-detection for code, data, and an online demo of the project.
Chinese: 本文提出了一种通过将文本建模为随机过程并使用零样本统计测试来验证其是否由特定大型语言模型生成的方法,该方法在理论和实验中均表现出色,错误率随文本长度呈指数级下降,并在白盒和黑盒环境下均优于现有基线。
English: This paper introduces a method to verify whether text is generated by a specific large language model by modeling it as a stochastic process and using zero-shot statistical tests, with proven exponential error reduction and strong performance in both white-box and black-box settings.

Authors:Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, Guangyao Shi
Title: A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
Abstract:
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP, Claude, and GPT-4V demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification [93]. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed up to 2025; [2] the transition of VLM architectures and the newest VLM alignment methods; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the challenges and issues faced by current VLMs such as hallucination, alignment, fairness, and safety. Detailed collections including papers and model repository links are listed in https://github.com/zli12321/Vision-Language-Models-Overview.
中文: 多模态视觉语言模型融合计算机视觉与自然语言处理,使机器能通过视觉和文本模态感知与推理,本综述系统梳理了其发展历程、架构、评估基准及幻觉与安全等挑战。
English: Multimodal Vision Language Models (VLMs) integrate computer vision and natural language processing to enable machines to perceive and reason through visual and textual data, with this survey systematically reviewing their development, architectures, benchmarks, and challenges like hallucination and safety.

Authors:Tianyu Gao, Alexander Wettig, Luxi He, Yihe Dong, Sadhika Malladi, Danqi Chen
Title: Metadata Conditioning Accelerates Language Model Pre-training
Abstract:
The vast diversity of styles, domains, and quality levels present in language model pre-training corpora is essential in developing general model capabilities, but efficiently learning and deploying the correct behaviors exemplified in each of these heterogeneous data sources is challenging. To address this, we propose a new method, termed Metadata Conditioning then Cooldown (MeCo), to incorporate additional learning cues during pre-training. MeCo first provides metadata (e.g., URLs like www$.$wikipedia$.$org) alongside the text during training and later uses a cooldown phase with only the standard text, thereby enabling the model to function normally even without metadata. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM). For instance, a 1.6B language model trained with MeCo matches the downstream task performance of standard pre-training while using 33% less data. Additionally, MeCo enables us to steer language models by conditioning the inference prompt on either real or fabricated metadata that encodes the desired properties of the output: for example, prepending wikipedia$.$org to reduce harmful generations or factquizmaster$.$com (fabricated) to improve common knowledge task performance. We also demonstrate that MeCo is compatible with different types of metadata, such as model-generated topics. MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
中文:MeCo方法通过在预训练中先结合元数据与文本再进行纯文本冷却阶段,显著提升了训练效率并实现了无需额外计算开销的模型引导能力。
English: The MeCo method enhances pre-training by initially using metadata alongside text and then transitioning to text-only training, significantly improving efficiency and enabling model steering without added computational cost.

Authors:Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, Fei Huang
Title: SDPO: Segment-Level Direct Preference Optimization for Social Agents
Abstract:
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues. Direct Preference Optimization (DPO) has proven effective in aligning LLM behavior with human preferences across various agent tasks. However, standard DPO focuses solely on individual turns, which limits its effectiveness in multi-turn social interactions. Several DPO-based multi-turn alignment methods with session-level data have shown potential in addressing this problem.While these methods consider multiple turns across entire sessions, they are often overly coarse-grained, introducing training noise, and lack robust theoretical support. To resolve these limitations, we propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior. SDPO minimizes training noise and is grounded in a rigorous theoretical framework. Evaluations on the SOTOPIA benchmark demonstrate that SDPO-tuned agents consistently outperform both existing DPO-based methods and proprietary LLMs like GPT-4o, underscoring SDPO's potential to advance the social intelligence of LLM-based agents. We release our code and data at https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/SDPO.
Chinese Summary: 提出的分段级直接偏好优化(SDPO)方法通过动态选择关键交互片段来优化多轮智能体行为,在SOTOPIA基准测试中优于现有方法和GPT-4o,同时最大程度减少了训练噪声。
English Summary: The proposed Segment-Level Direct Preference Optimization (SDPO) method dynamically selects key interaction segments to optimize multi-turn agent behavior, outperforming existing methods and GPT-4o on the SOTOPIA benchmark while minimizing training noise.

Authors:Nouran Khallaf, Carlo Eugeni, Serge Sharoff
Title: Reading Between the Lines: A dataset and a study on why some texts are tougher than others
Abstract:
Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences. The resources are available from https://github.com/Nouran-Khallaf/why-tough
中文摘要:本研究针对智力障碍读者的文本难度因素,结合心理学与翻译研究开发了标注方案,并利用Transformer模型预测简化策略,同时探索模型决策机制的可解释性。
English Summary: This study investigates text difficulty factors for intellectually disabled readers by developing an annotation scheme based on psychological and translation research, and employs transformer models to predict simplification strategies while interpreting their decision-making processes.

Authors:Bohan Zhang, Xiaokang Zhang, Jing Zhang, Jifan Yu, Sijia Luo, Jie Tang
Title: CoT-based Synthesizer: Enhancing LLM Performance through Answer Synthesis
Abstract:
Current inference scaling methods, such as Self-consistency and Best-of-N, have proven effective in improving the accuracy of LLMs on complex reasoning tasks. However, these methods rely heavily on the quality of candidate responses and are unable to produce correct answers when all candidates are incorrect. In this paper, we propose a novel inference scaling strategy, CoT-based Synthesizer, which leverages CoT reasoning to synthesize superior answers by analyzing complementary information from multiple candidate responses, even when all candidate responses are flawed. To enable a lightweight and cost-effective implementation, we introduce an automated data generation pipeline that creates diverse training data. This allows smaller LLMs trained on this data to improve the inference accuracy of larger models, including API-based LLMs. Experimental results across four benchmark datasets with seven policy models demonstrate that our method significantly enhances performance, with gains of 11.8% for Llama3-8B and 10.3% for GPT-4o on the MATH dataset. The corresponding training data and code are publicly available on https://github.com/RUCKBReasoning/CoT-based-Synthesizer.
Chinese: 本文提出了一种基于思维链的合成器新策略,通过分析多个候选回答的互补信息来合成更优答案,即使在所有候选答案均有缺陷时也能提升大语言模型的推理准确率,实验证明该方法在多个基准数据集上显著提升了模型性能。
English: This paper introduces a CoT-based Synthesizer, a novel inference scaling strategy that enhances LLM accuracy by synthesizing superior answers from flawed candidate responses using complementary reasoning, with experiments showing significant performance gains on benchmark datasets.

Authors:Yin Cai, Zhouhong Gu, Zhaohan Du, Zheyu Ye, Shaosheng Cao, Yiqian Xu, Hongwei Feng, Ping Chen
Title: MIRAGE: Exploring How Large Language Models Perform in Complex Social Interactive Environments
Abstract:
Large Language Models (LLMs) have shown remarkable capabilities in environmental perception, reasoning-based decision-making, and simulating complex human behaviors, particularly in interactive role-playing contexts. This paper introduces the Multiverse Interactive Role-play Ability General Evaluation (MIRAGE), a comprehensive framework designed to assess LLMs' proficiency in portraying advanced human behaviors through murder mystery games. MIRAGE features eight intricately crafted scripts encompassing diverse themes and styles, providing a rich simulation. To evaluate LLMs' performance, MIRAGE employs four distinct methods: the Trust Inclination Index (TII) to measure dynamics of trust and suspicion, the Clue Investigation Capability (CIC) to measure LLMs' capability of conducting information, the Interactivity Capability Index (ICI) to assess role-playing capabilities and the Script Compliance Index (SCI) to assess LLMs' capability of understanding and following instructions. Our experiments indicate that even popular models like GPT-4 face significant challenges in navigating the complexities presented by the MIRAGE. The datasets and simulation codes are available in \href{https://github.com/lime728/MIRAGE}{github}.
中文摘要:本文提出MIRAGE框架,通过谋杀之谜游戏评估大语言模型模拟复杂人类行为的能力,发现即使是GPT-4等先进模型在复杂角色扮演场景中仍面临显著挑战。
English Summary: This paper introduces the MIRAGE framework to evaluate large language models' ability to simulate complex human behaviors through murder mystery games, finding that even advanced models like GPT-4 struggle with the sophisticated role-playing scenarios.

Authors:Tien Dang, Viet Thanh Duy Nguyen, Minh Tuan Le, Truong-Son Hy
Title: Multimodal Contrastive Representation Learning in Augmented Biomedical Knowledge Graphs
Abstract:
Biomedical Knowledge Graphs (BKGs) integrate diverse datasets to elucidate complex relationships within the biomedical field. Effective link prediction on these graphs can uncover valuable connections, such as potential novel drug-disease relations. We introduce a novel multimodal approach that unifies embeddings from specialized Language Models (LMs) with Graph Contrastive Learning (GCL) to enhance intra-entity relationships while employing a Knowledge Graph Embedding (KGE) model to capture inter-entity relationships for effective link prediction. To address limitations in existing BKGs, we present PrimeKG++, an enriched knowledge graph incorporating multimodal data, including biological sequences and textual descriptions for each entity type. By combining semantic and relational information in a unified representation, our approach demonstrates strong generalizability, enabling accurate link predictions even for unseen nodes. Experimental results on PrimeKG++ and the DrugBank drug-target interaction dataset demonstrate the effectiveness and robustness of our method across diverse biomedical datasets. Our source code, pre-trained models, and data are publicly available at https://github.com/HySonLab/BioMedKG
我们提出的多模态方法将语言模型嵌入与图对比学习和知识图谱嵌入相结合,显著提升了生物医学链接预测效果,在PrimeKG++和DrugBank等丰富知识图谱数据集上表现出卓越性能。
Our novel multimodal approach integrates language model embeddings with graph contrastive learning and knowledge graph embeddings to enhance biomedical link prediction, demonstrating strong performance on enriched knowledge graphs like PrimeKG++ and DrugBank datasets.

Authors:Xize Cheng, Dongjie Fu, Xiaoda Yang, Minghui Fang, Ruofan Hu, Jingyu Lu, Bai Jionghao, Zehan Wang, Shengpeng Ji, Rongjie Huang, Linjun Li, Yu Chen, Tao Jin, Zhou Zhao
Title: OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios
Abstract:
With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduce ShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduce OmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.
中文: 本文提出了首个大规模合成口语对话数据集ShareChatX和多轮对话系统OmniChat,通过优化合成与真实数据的配比,在包含音频和音乐等复杂场景的对话任务中取得了最优性能。
English: This paper introduces ShareChatX, a large-scale synthetic spoken dialogue dataset, and OmniChat, a multi-turn dialogue system that achieves state-of-the-art performance by optimally integrating synthetic and real data to handle complex scenarios like audio events and emotional expressions.

Authors:Yong Zhao, Yang Deng, See-Kiong Ng, Tat-Seng Chua
Title: Aligning Large Language Models for Faithful Integrity Against Opposing Argument
Abstract:
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. However, they can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. To this end, we investigate the problem of maintaining faithful integrity in LLMs. This involves ensuring that LLMs adhere to their faithful statements in the face of opposing arguments and are able to correct their incorrect statements when presented with faithful arguments. In this work, we propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation (AFICE), which aims to align the LLM responses with faithful integrity. Specifically, AFICE first designs a Bilateral Confidence Estimation (BCE) approach for estimating the uncertainty of each response generated by the LLM given a specific context, which simultaneously estimate the model's confidence to the question based on the internal states during decoding as well as to the answer based on cumulative probability ratios. With the BCE, we construct a conversational preference dataset composed of context, original statement, and argument, which is adopted for aligning the LLM for faithful integrity using Direct Preference Optimization (DPO). Extensive experimental results on a wide range of benchmarks demonstrate significant improvements in the LLM's ability to maintain faithful responses when encountering opposing arguments, ensuring both the practical utility and trustworthiness of LLMs in complex interactive settings. Code and data will be released via https://github.com/zhaoy777/AFICE.git
中文摘要:AFICE框架通过双边置信度估计和直接偏好优化,增强大语言模型在遇到对立论点时保持忠实完整性的能力,确保其回应的可靠性和一致性。
English Summary: The AFICE framework enhances large language models' ability to maintain faithful integrity by using bilateral confidence estimation and direct preference optimization to ensure consistent responses despite opposing arguments.

Authors:Xiaoshuai Song, Yanan Wu, Weixun Wang, Jiaheng Liu, Wenbo Su, Bo Zheng
Title: ProgCo: Program Helps Self-Correction of Large Language Models
Abstract:
Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further misleading refinement and leading to the failure of self-correction, especially in complex reasoning tasks. In this paper, we propose Program-driven Self-Correction (ProgCo). First, program-driven verification (ProgVe) achieves complex verification logic and extensive validation through self-generated, self-executing verification pseudo-programs. Then, program-driven refinement (ProgRe) receives feedback from ProgVe, conducts dual reflection and refinement on both responses and verification programs to mitigate misleading of incorrect feedback in complex reasoning tasks. Experiments on three instruction-following and mathematical benchmarks indicate that ProgCo achieves effective self-correction, and can be further enhance performance when combined with real program tools. We release our code at https://github.com/songxiaoshuai/progco.
大语言模型的自校正通过ProgCo得以增强,它利用自生成的验证伪程序来验证和优化响应,从而在复杂推理任务中提升性能。
Self-correction in large language models is enhanced by ProgCo, which uses self-generated verification pseudo-programs to verify and refine responses, improving performance in complex reasoning tasks.

Authors:Haina Zhu, Yizhi Zhou, Hangting Chen, Jianwei Yu, Ziyang Ma, Rongzhi Gu, Yi Luo, Wei Tan, Xie Chen
Title: MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization
Abstract:
Recent years have witnessed the success of foundation models pre-trained with self-supervised learning (SSL) in various music informatics understanding tasks, including music tagging, instrument classification, key detection, and more. In this paper, we propose a self-supervised music representation learning model for music understanding. Distinguished from previous studies adopting random projection or existing neural codec, the proposed model, named MuQ, is trained to predict tokens generated by Mel Residual Vector Quantization (Mel-RVQ). Our Mel-RVQ utilizes residual linear projection structure for Mel spectrum quantization to enhance the stability and efficiency of target extraction and lead to better performance. Experiments in a large variety of downstream tasks demonstrate that MuQ outperforms previous self-supervised music representation models with only 0.9K hours of open-source pre-training data. Scaling up the data to over 160K hours and adopting iterative training consistently improve the model performance. To further validate the strength of our model, we present MuQ-MuLan, a joint music-text embedding model based on contrastive learning, which achieves state-of-the-art performance in the zero-shot music tagging task on the MagnaTagATune dataset. Code and checkpoints are open source in https://github.com/tencent-ailab/MuQ.
Chinese: MuQ模型采用梅尔残差向量量化进行自监督音乐表示学习,仅用少量预训练数据即在多项任务中表现卓越,并能通过数据扩展持续提升性能。
English: The MuQ model introduces a self-supervised music representation learning approach using Mel Residual Vector Quantization, achieving superior performance across multiple tasks with minimal pre-training data and scaling effectively to larger datasets.

Authors:Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir
Title: BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion
Abstract:
Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
Chinese: 本研究提出了一种新颖的上下文多输入特征融合方法MultiGen,用于孟加拉语宗教新闻标题生成,通过整合情感、类别和方面特征,在BLEU和ROUGE-L得分上优于仅基于内容的方法。
English: This study introduces a novel contextual multi-input feature fusion approach, MultiGen, for Bengali religious news headline generation, which outperforms content-only methods by integrating sentiment, category, and aspect features, achieving superior BLEU and ROUGE-L scores.

Authors:Youngjun Son, Chaewon Kim, Jaejin Lee
Title: FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration
Abstract:
Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).
中文: FED框架通过GPU优化的MinHash LSH和高效哈希函数显著加速数据集去重,在保持高质量去重的同时,处理速度比CPU工具快107.2倍,比GPU方案快6.3倍。
English: The FED framework significantly accelerates dataset deduplication using GPU-optimized MinHash LSH and efficient hash functions, achieving up to 107.2x speed over CPU tools and 6.3x over GPU alternatives while maintaining high deduplication quality.

Authors:Bin Wang, Xunlong Zou, Shuo Sun, Wenyu Zhang, Yingxu He, Zhuohan Liu, Chengwei Wei, Nancy F. Chen, AiTi Aw
Title: Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models
Abstract:
Singlish, a Creole language rooted in English, is a key focus in linguistic research within multilingual and multicultural contexts. However, its spoken form remains underexplored, limiting insights into its linguistic structure and applications. To address this gap, we standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC). These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS), and Paralinguistic Question Answering (PQA). We release standardized splits and a human-verified test set to facilitate further research. Additionally, we propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently. Experiments reveal our models adaptability to Singlish context, achieving state-of-the-art performance and outperforming prior models by 10-30% in comparison with other AudioLLMs and cascaded solutions.
中文摘要:本研究推出了最大的标准化标注口语新加坡英语语料库——多任务国家语音语料库(MNSC),并提出了多任务多模态模型SingAudioLLM,该模型在多项语音处理任务中表现优异,性能超越先前模型10-30%,达到当前最优水平。
English Summary: This study introduces the largest standardized and annotated spoken Singlish corpus, the Multitask National Speech Corpus (MNSC), along with a multi-task multimodal model called SingAudioLLM, which achieves state-of-the-art performance by outperforming previous models by 10-30% across various speech processing tasks.

Authors:Wenqi Zhang, Hang Zhang, Xin Li, Jiashuo Sun, Yongliang Shen, Weiming Lu, Deli Zhao, Yueting Zhuang, Lidong Bing
Title: 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Abstract:
Compared to image-text pair data, interleaved corpora enable Vision-Language Models (VLMs) to understand the world more naturally like humans. However, such existing datasets are crawled from webpage, facing challenges like low knowledge density, loose image-text relations, and poor logical coherence between images. On the other hand, the internet hosts vast instructional videos (e.g., online geometry courses) that are widely used by humans to learn foundational subjects, yet these valuable resources remain underexplored in VLM training. In this paper, we introduce a high-quality \textbf{multimodal textbook} corpus with richer foundational knowledge for VLM pretraining. It collects over 2.5 years of instructional videos, totaling 22,000 class hours. We first use an LLM-proposed taxonomy to systematically gather instructional videos. Then we progressively extract and refine visual (keyframes), audio (ASR), and textual knowledge (OCR) from the videos, and organize as an image-text interleaved corpus based on temporal order. Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment. Experiments demonstrate its superb pretraining performance, particularly in knowledge- and reasoning-intensive tasks like ScienceQA and MathVista. Moreover, VLMs pre-trained on our textbook exhibit outstanding interleaved context awareness, leveraging visual and textual cues in their few-shot context for task solving. Our code are available at https://github.com/DAMO-NLP-SG/multimodal_textbook.
中文摘要:本文提出了一种从教学视频中提取的高质量多模态教材语料库,为视觉语言模型提供了更丰富的知识基础和更好的图文对齐,显著提升了其在知识密集型和推理任务中的表现。
English Summary: This paper introduces a high-quality multimodal textbook corpus derived from instructional videos, which provides richer foundational knowledge and better image-text alignment for Vision-Language Models, significantly enhancing their performance in knowledge-intensive and reasoning tasks.

Authors:Nicholas Magal, Minh Tran, Riku Arakawa, Suzanne Nie
Title: Negative to Positive Co-learning with Aggressive Modality Dropout
Abstract:
This paper aims to document an effective way to improve multimodal co-learning by using aggressive modality dropout. We find that by using aggressive modality dropout we are able to reverse negative co-learning (NCL) to positive co-learning (PCL). Aggressive modality dropout can be used to "prep" a multimodal model for unimodal deployment, and dramatically increases model performance during negative co-learning, where during some experiments we saw a 20% gain in accuracy. We also benchmark our modality dropout technique against PCL to show that our modality drop out technique improves co-learning during PCL, although it does not have as much as an substantial effect as it does during NCL. Github: https://github.com/nmagal/modality_drop_for_colearning
中文摘要:本文通过采用激进的模态丢弃方法,成功将负协同学习转化为正协同学习,使模型准确率最高提升20%,并为单模态部署做好了准备。
English Summary: This paper demonstrates that aggressive modality dropout effectively reverses negative co-learning into positive co-learning, significantly boosting model accuracy by up to 20% and preparing models for unimodal deployment.

Authors:Yiwei Qin, Yixiu Liu, Pengfei Liu
Title: DIVE: Diversified Iterative Self-Improvement
Abstract:
Recent advances in large language models (LLMs) have demonstrated the effectiveness of Iterative Self-Improvement (ISI) techniques. However, continuous training on self-generated data leads to reduced output diversity, a limitation particularly critical in reasoning tasks where diverse solution paths are essential. We present DIVE (Diversified Iterative Self-Improvement), a novel framework that addresses this challenge through two key components: Sample Pool Expansion for broader solution exploration, and Data Selection for balancing diversity and quality in preference pairs. Experiments on MATH and GSM8k datasets show that DIVE achieves a 10% to 45% relative increase in output diversity metrics while maintaining performance quality compared to vanilla ISI. Our ablation studies confirm both components' significance in achieving these improvements. Code is available at https://github.com/qinyiwei/DIVE.
中文: DIVE框架通过扩展样本池和平衡多样性及质量的数据选择,改进了大型语言模型的迭代自我优化,在推理任务中显著提升了输出多样性,同时保持了性能水平。
English: The DIVE framework enhances iterative self-improvement in large language models by expanding sample pools and selecting data to balance diversity and quality, achieving significant gains in output diversity without compromising performance on reasoning tasks.

Authors:Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, Zhangyang Wang
Title: Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding
Abstract:
Transformers rely on both content-based and position-based addressing mechanisms to make predictions, but existing positional encoding techniques often diminish the effectiveness of position-based addressing. Many current methods enforce rigid patterns in attention maps, limiting the ability to model long-range dependencies and adapt to diverse tasks. Additionally, most positional encodings are learned as general biases, lacking the specialization required for different instances within a dataset. To address this, we propose con\textbf{T}extualized equivari\textbf{A}nt \textbf{P}osition \textbf{E}ncoding (\textbf{TAPE}), a novel framework that enhances positional embeddings by incorporating sequence content across layers. TAPE introduces dynamic, context-aware positional encodings, overcoming the constraints of traditional fixed patterns. We show that TAPE can provably facilitate LLM reasoning ability by emulating a broader class of algorithms. By enforcing permutation and orthogonal equivariance, TAPE ensures the stability of positional encodings during updates, improving long-context ability. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead. Extensive experiments show that TAPE achieves superior performance in language modeling, arithmetic reasoning, and long-context retrieval tasks compared to existing positional embedding techniques. Code is available at https://github.com/VITA-Group/TAPE.
中文: 提出的TAPE框架通过在各层融入序列内容,引入动态、上下文感知的位置编码,增强了Transformer的长上下文建模和推理能力,同时利用等变性确保稳定性,从而提升整体性能。
English: The proposed TAPE framework introduces dynamic, context-aware positional encodings that enhance transformer performance by incorporating sequence content across layers, improving long-context modeling and reasoning abilities while ensuring stability through equivariance properties.

Authors:Md Rakibul Hasan, Yue Yao, Md Zakir Hossain, Aneesh Krishna, Imre Rudas, Shafin Rahman, Tom Gedeon
Title: Labels Generated by Large Language Models Help Measure People's Empathy in Vitro
Abstract:
Large language models (LLMs) have revolutionised many fields, with LLM-as-a-service (LLMSaaS) offering accessible, general-purpose solutions without costly task-specific training. In contrast to the widely studied prompt engineering for directly solving tasks (in vivo), this paper explores LLMs' potential for in-vitro applications: using LLM-generated labels to improve supervised training of mainstream models. We examine two strategies - (1) noisy label correction and (2) training data augmentation - in empathy computing, an emerging task to predict psychology-based questionnaire outcomes from inputs like textual narratives. Crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. We show that replacing or supplementing these crowdsourced labels with LLM-generated labels, developed using psychology-based scale-aware prompts, achieves statistically significant accuracy improvements. Notably, the RoBERTa pre-trained language model (PLM) trained with noise-reduced labels yields a state-of-the-art Pearson correlation coefficient of 0.648 on the public NewsEmp benchmarks. This paper further analyses evaluation metric selection and demographic biases to help guide the future development of more equitable empathy computing models. Code and LLM-generated labels are available at https://github.com/hasan-rakibul/LLMPathy.
中文: 本研究证明,利用大语言模型生成的标签进行噪声标签校正和数据增强,显著提升了共情计算中监督模型的准确性,在基准数据集上达到了最先进的性能水平。
English: This study demonstrates that using large language model-generated labels for noisy label correction and data augmentation significantly improves the accuracy of supervised models in empathy computing, achieving state-of-the-art performance on benchmark datasets.

Authors:Yomal De Mel, Kasun Wickramasinghe, Nisansa de Silva, Surangika Ranathunga
Title: Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches
Abstract:
Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer-based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method. The code base associated with this paper is available on GitHub - https://github.com/kasunw22/Sinhala-Transliterator/
中文: 本研究针对僧伽罗语罗马化转写的普遍问题,比较了基于规则的方法和基于Transformer的序列到序列方法,发现后者能更有效地捕捉转写中的临时模式。
English: The study addresses the prevalence of Romanized Sinhala transliteration by comparing a rule-based method with a Transformer-based sequence-to-sequence approach, finding the latter more effective at capturing ad-hoc patterns in the scripts.

Authors:Ke Yang, Volodymyr Kindratenko, ChengXiang Zhai
Title: TinyHelen's First Curriculum: Training and Evaluating Tiny Language Models in a Simpler Language Environment
Abstract:
Training language models (LMs) and their application agents is increasingly costly due to large datasets and models, making test failures difficult to bear. Simplified language environments serve as primordial training and testing grounds, retaining essential commonsense and communication skills but in a more digestible form, potentially enhancing the learning efficiency of LMs, and thus reducing the required model size and data volume for effective training and evaluation. In these simplified language environments, workable strategies for small models, datasets, and agents may be adaptable to larger models, datasets, and agents in complex language environments. To create such environments, we focus on two aspects: i) minimizing language dataset noise and complexity, and ii) preserving the essential text distribution characteristics. Unlike previous methods, we propose a pipeline to refine text data by eliminating noise, minimizing vocabulary, and maintaining genre-specific patterns (e.g., for books, conversation, code, etc.). Implementing this pipeline with large LMs, we have created a leaner suite of LM training and evaluation datasets: 71M Leaner-Pretrain, 7M Leaner-Instruct, Leaner-Glue for assessing linguistic proficiency, and Leaner-Eval for testing instruction-following ability. Our experiments show that leaner pre-training boosts LM learning efficiency. Tiny LMs trained on these datasets outperform those trained on original datasets in instruction-following across different language granularity levels. Moreover, the Leaner-Pretrain dataset's alignment with conventional large LM training sets enables resource-optimized analysis of how learning objectives, model architectures, and training techniques impact performance on language modeling and downstream tasks. Our code and datasets are available at https://github.com/EmpathYang/TinyHelen.git.
中文:简化语言环境通过降低数据集复杂性同时保留核心语言特征,提升了语言模型的训练效率,使得小型模型能超越传统训练方法的表现,并为资源优化的性能分析提供了可能。
English: Simplified language environments enhance LM training efficiency by reducing dataset complexity while preserving essential linguistic features, enabling smaller models to outperform traditional training methods and facilitating resource-optimized performance analysis.

Authors:Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Title: Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Abstract:
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP space. Next, the patch-wise visual features of the input image are selectively fused with the textual features of the salient visual concepts, leading to a mixed-up feature map with less defective content. Finally, a visual-semantic encoder is exploited to refine the derived feature map, which is further incorporated into the sentence decoder for caption generation. Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts. Extensive experiments on MSCOCO and Flickr30k datasets demonstrate the superiority of our PCM-Net compared with state-of-the-art VLMs-based approaches. It is noteworthy that our PCM-Net ranks first in both in-domain and cross-domain zero-shot image captioning. The synthetic dataset SynthImgCap and code are available at https://jianjieluo.github.io/SynthImgCap.
中文摘要:提出的PCM-Net通过逐块跨模态特征混合机制,在零样本图像描述任务中自适应融合文本特征以修正合成图像的语义偏差,在基准数据集上实现了最优性能。
English Summary: The proposed PCM-Net introduces a patch-wise cross-modal feature mix-up mechanism to address semantic misalignment in zero-shot image captioning by adaptively refining synthetic image features with textual concepts, achieving state-of-the-art performance on benchmark datasets.

Authors:Wanlong Liu, Junying Chen, Ke Ji, Li Zhou, Wenyu Chen, Benyou Wang
Title: RAG-Instruct: Boosting LLMs with Diverse Retrieval-Augmented Instructions
Abstract:
Retrieval-Augmented Generation (RAG) has emerged as a key paradigm for enhancing large language models (LLMs) by incorporating external knowledge. However, current RAG methods face two limitations: (1) they only cover limited RAG scenarios. (2) They suffer from limited task diversity due to the lack of a general RAG dataset. To address these limitations, we propose RAG-Instruct, a general method for synthesizing diverse and high-quality RAG instruction data based on any source corpus. Our approach leverages (1) five RAG paradigms, which encompass diverse query-document relationships, and (2) instruction simulation, which enhances instruction diversity and quality by utilizing the strengths of existing instruction datasets. Using this method, we construct a 40K instruction dataset from Wikipedia, comprehensively covering diverse RAG scenarios and tasks. Experiments demonstrate that RAG-Instruct effectively enhances LLMs' RAG capabilities, achieving strong zero-shot performance and significantly outperforming various RAG baselines across a diverse set of tasks. RAG-Instruct is publicly available at https://github.com/FreedomIntelligence/RAG-Instruct.
中文摘要:RAG-Instruct是一种基于任意语料库、通过五种RAG范式和指令模拟技术合成多样化高质量指令数据的新方法,能显著提升大语言模型在多种任务中的检索增强生成性能。
English Summary: RAG-Instruct is a novel method that synthesizes diverse, high-quality instruction data from any corpus using five RAG paradigms and instruction simulation, effectively enhancing LLMs' retrieval-augmented generation capabilities across various tasks.

Authors:Mahir Labib Dihan, Md Tanvir Hassan, Md Tanvir Parvez, Md Hasebul Hasan, Md Almash Alam, Muhammad Aamir Cheema, Mohammed Eunus Ali, Md Rizwan Parvez
Title: MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models
Abstract:
Recent advancements in foundation models have improved autonomous tool usage and reasoning, but their capabilities in map-based reasoning remain underexplored. To address this, we introduce MapEval, a benchmark designed to assess foundation models across three distinct tasks - textual, API-based, and visual reasoning - through 700 multiple-choice questions spanning 180 cities and 54 countries, covering spatial relationships, navigation, travel planning, and real-world map interactions. Unlike prior benchmarks that focus on simple location queries, MapEval requires models to handle long-context reasoning, API interactions, and visual map analysis, making it the most comprehensive evaluation framework for geospatial AI. On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro, none surpass 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. These results expose critical gaps in spatial inference, as models struggle with distances, directions, route planning, and place-specific reasoning, highlighting the need for better geospatial AI to bridge the gap between foundation models and real-world navigation. All the resources are available at: https://mapeval.github.io/.

Authors:Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, Jiliang Tang
Title: Retrieval-Augmented Generation with Graphs (GraphRAG)
Abstract:
Retrieval-augmented generation (RAG) is a powerful technique that enhances downstream task execution by retrieving additional information, such as knowledge, skills, and tools from external sources. Graph, by its intrinsic "nodes connected by edges" nature, encodes massive heterogeneous and relational information, making it a golden resource for RAG in tremendous real-world applications. As a result, we have recently witnessed increasing attention on equipping RAG with Graph, i.e., GraphRAG. However, unlike conventional RAG, where the retriever, generator, and external data sources can be uniformly designed in the neural-embedding space, the uniqueness of graph-structured data, such as diverse-formatted and domain-specific relational knowledge, poses unique and significant challenges when designing GraphRAG for different domains. Given the broad applicability, the associated design challenges, and the recent surge in GraphRAG, a systematic and up-to-date survey of its key concepts and techniques is urgently desired. Following this motivation, we present a comprehensive and up-to-date survey on GraphRAG. Our survey first proposes a holistic GraphRAG framework by defining its key components, including query processor, retriever, organizer, generator, and data source. Furthermore, recognizing that graphs in different domains exhibit distinct relational patterns and require dedicated designs, we review GraphRAG techniques uniquely tailored to each domain. Finally, we discuss research challenges and brainstorm directions to inspire cross-disciplinary opportunities. Our survey repository is publicly maintained at https://github.com/Graph-RAG/GraphRAG/.
Chinese: GraphRAG通过利用图结构数据增强检索生成能力,针对不同领域的独特挑战提出了系统性框架和定制化技术解决方案。
English: GraphRAG enhances retrieval-augmented generation by leveraging graph-structured data, addressing unique challenges across domains through a systematic framework and tailored techniques.

Authors:James P. Beno
Title: ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis
Abstract:
Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.50 macro F1 vs. 79.14 ELECTRA Base FT, 79.41 GPT-4o-mini) and yielded the lowest cost/performance ratio (\$0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.70) at much less cost (\$0.38 vs. \$1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.
中文: 研究表明,将ELECTRA的预测结果与GPT-4o-mini结合能显著提升情感分析性能且成本效益突出,而经过微调的GPT模型中,GPT-4o-mini以76%的成本降幅实现了与GPT-4o近乎相当的性能。
English: This study demonstrates that combining ELECTRA's predictions with GPT-4o-mini significantly enhances sentiment analysis performance cost-effectively, while fine-tuned GPT models achieve top results with GPT-4o-mini offering nearly equivalent performance at substantially lower cost.